Sri Satish Ambati, H20.ai | CUBE Conversation, May 2020
>> connecting with thought leaders all around the world, this is a CUBE Conversation. Hi, everybody this is Dave Vellante of theCUBE, and welcome back to my CXO series. I've been running this through really since the start of the COVID-19 crisis to really understand how leaders are dealing with this pandemic. Sri Ambati is here, he's the CEO and founder of H20. Sri, it's great to see you again, thanks for coming on. >> Thank you for having us. >> Yeah, so this pandemic has obviously given people fits, no question, but it's also given opportunities for companies to kind of reassess where they are. Automation is a huge watchword, flexibility, business resiliency and people who maybe really hadn't fully leaned into things like the cloud and AI and automation are now realizing, wow, we have no choice, it's about survival. Your thought as to what you're seeing in the marketplace. >> Thanks for having us. I think first of all, kudos to the frontline health workers who have been ruthlessly saving lives across the country and the world, and what you're really doing is a fraction of what we could have done or should be doing to stay away the next big pandemic. But that apart I think, I usually tend to say BC is before COVID. So if the world was thinking about going digital after COVID-19, they have been forced to go digital and as a result, you're seeing tremendous transformation across our customers, and a lot of application to kind of go in and reinvent their business models that allow them to scale as effortlessly as they could using the digital means. >> So, think about, doctors and diagnosis machines, in some cases, are helping doctors make diagnoses, they're sometimes making even better diagnosis, (mumbles) is informing. There's been a lot of talk about the models, you know how... Yeah, I know you've been working with a lot of healthcare organizations, you may probably familiar with that, you know, the Medium post, The Hammer and the Dance, and if people criticize the models, of course, they're just models, right? And you iterate models and machine intelligence can help us improve. So, in this, you know, you talk about BC and post C, how have you seen the data and in machine intelligence informing the models and proving that what we know about this pandemic, I mean, it changed literally daily, what are you seeing? >> Yeah, and I think it started with Wuhan and we saw the best application of AI in trying to trace, literally from Alipay, to WeChat, track down the first folks who were spreading it across China and then eventually the rest of the world. I think contact tracing, for example, has become a really interesting problem. supply chain has been disrupted like never before. We're beginning to see customers trying to reinvent their distribution mechanisms in the second order effects of the COVID, and the the prime center is hospital staffing, how many ventilator, is the first few weeks so that after COVID crisis as it evolved in the US. We are busy predicting working with some of the local healthcare communities to predict how staffing in hospitals will work, how many PPE and ventilators will be needed and so henceforth, but that quickly and when the peak surge will be those with the beginning problems, and many of our customers have begin to do these models and iterate and improve and kind of educate the community to practice social distancing, and that led to a lot of flattening the curve and you're talking flattening the curve, you're really talking about data science and analytics in public speak. That led to kind of the next level, now that we have somewhat brought a semblance of order to the reaction to COVID, I think what we are beginning to figure out is, is there going to be a second surge, what elective procedures that were postponed, will be top of the mind for customers, and so this is the kind of things that hospitals are beginning to plan out for the second half of the year, and as businesses try to open up, certain things were highly correlated to surgeon cases, such as cleaning supplies, for example, the obvious one or pantry buying. So retailers are beginning to see what online stores are doing well, e-commerce, online purchases, electronic goods, and so everyone essentially started working from home, and so homes needed to have the same kind of bandwidth that offices and commercial enterprises needed to have, and so a lot of interesting, as one side you saw airlines go away, this side you saw the likes of Zoom and video take off. So you're kind of seeing a real divide in the digital divide and that's happening and AI is here to play a very good role to figure out how to enhance your profitability as you're looking about planning out the next two years. >> Yeah, you know, and obviously, these things they get, they get partisan, it gets political, I mean, our job as an industry is to report, your job is to help people understand, I mean, let the data inform and then let public policy you know, fight it out. So who are some of the people that you're working with that you know, as a result of COVID-19. What's some of the work that H2O has done, I want to better understand what role are you playing? >> So one of the things we're kind of privileged as a company to come into the crisis, with a strong balance and an ability to actually have the right kind of momentum behind the company in terms of great talent, and so we have 10% of the world's top data scientists in the in the form of Kaggle Grand Masters in the company. And so we put most of them to work, and they started collecting data sets, curating data sets and making them more qualitative, picking up public data sources, for example, there's a tremendous amount of job loss out there, figuring out which are the more difficult kind of sectors in the economy and then we started looking at exodus from the cities, we're looking at mobility data that's publicly available, mobility data through the data exchanges, you're able to find which cities which rural areas, did the New Yorkers as they left the city, which places did they go to, and what's to say, Californians when they left Los Angeles, which are the new places they have settled in? These are the places which are now busy places for the same kind of items that you need to sell if you're a retailer, but if you go one step further, we started engaging with FEMA, we start engaging with the universities, like Imperial College London or Berkeley, and started figuring out how best to improve the models and automate them. The SEER model, the most popular SEER model, we added that into our Driverless AI product as a recipe and made that accessible to our customers in testing, to customers in healthcare who are trying to predict where the surge is likely to come. But it's mostly about information right? So the AI at the end of it is all about intelligence and being prepared. Predictive is all about being prepared and that's kind of what we did with general, lots of blogs, typical blog articles and working with the largest health organizations and starting to kind of inform them on the most stable models. What we found to our not so much surprise, is that the simplest, very interpretable models are actually the most widely usable, because historical data is actually no longer as effective. You need to build a model that you can quickly understand and retry again to the feedback loop of back testing that model against what really happened. >> Yeah, so I want to double down on that. So really, two things I want to understand, if you have visibility on it, sounds like you do. Just in terms of the surge and the comeback, you know, kind of what those models say, based upon, you know, we have some advanced information coming from the global market, for sure, but it seems like every situation is different. What's the data telling you? Just in terms of, okay, we're coming into the spring and the summer months, maybe it'll come down a little bit. Everybody says it... We fully expect it to come back in the fall, go back to college, don't go back to college. What is the data telling you at this point in time with an understanding that, you know, we're still iterating every day? >> Well, I think I mean, we're not epidemiologists, but at the same time, the science of it is a highly local response, very hyper local response to COVID-19 is what we've seen. Santa Clara, which is just a county, I mean, is different from San Francisco, right, sort of. So you beginning to see, like we saw in Brooklyn, it's very different, and Bronx, very different from Manhattan. So you're seeing a very, very local response to this disease, and I'm talking about US. You see the likes of Brazil, which we're worried about, has picked up quite a bit of cases now. I think the silver lining I would say is that China is up and running to a large degree, a large number of our user base there are back active, you can see the traffic patterns there. So two months after their last research cases, the business and economic activity is back and thriving. And so, you can kind of estimate from that, that this can be done where you can actually contain the rise of active cases and it will take masking of the entire community, masking and the healthy dose of increase in testing. One of our offices is in Prague, and Czech Republic has done an incredible job in trying to contain this and they've done essentially, masked everybody and as a result they're back thinking about opening offices, schools later this month. So I think that's a very, very local response, hyper local response, no one country and no one community is symmetrical with other ones and I think we have a unique situation where in United States you have a very, very highly connected world, highly connected economy and I think we have quite a problem on our hands on how to safeguard our economy while also safeguarding life. >> Yeah, so you can't just, you can't just take Norway and apply it or South Korea and apply it, every situation is different. And then I want to ask you about, you know, the economy in terms of, you know, how much can AI actually, you know, how can it work in this situation where you have, you know, for example, okay, so the Fed, yes, it started doing asset buys back in 2008 but still, very hard to predict, I mean, at this time of this interview you know, Stock Market up 900 points, very difficult to predict that but some event happens in the morning, somebody, you know, Powell says something positive and it goes crazy but just sort of even modeling out the V recovery, the W recovery, deep recession, the comeback. You have to have enough data, do you not? In order for AI to be reasonably accurate? How does it work? And how does at what pace can you iterate and improve on the models? >> So I think that's exactly where I would say, continuous modeling, instead of continuously learning continuous, that's where the vision of the world is headed towards, where data is coming, you build a model, and then you iterate, try it out and come back. That kind of rapid, continuous learning would probably be needed for all our models as opposed to the typical, I'm pushing a model to production once a year, or once every quarter. I think what we're beginning to see is the kind of where companies are beginning to kind of plan out. A lot of people lost their jobs in the last couple of months, right, sort of. And so up scaling and trying to kind of bring back these jobs back both into kind of, both from the manufacturing side, but also lost a lot of jobs in the transportation and the kind of the airlines slash hotel industries, right, sort of. So it's trying to now bring back the sense of confidence and will take a lot more kind of testing, a lot more masking, a lot more social empathy, I think well, some of the things that we are missing while we are socially distant, we know that we are so connected as a species, we need to kind of start having that empathy for we need to wear a mask, not for ourselves, but for our neighbors and people we may run into. And I think that kind of, the same kind of thinking has to kind of parade, before we can open up the economy in a big way. The data, I mean, we can do a lot of transfer learning, right, sort of there are new methods, like try to model it, similar to the 1918, where we had a second bump, or a lot of little bumps, and that's kind of where your W shaped pieces, but governments are trying very well in seeing stimulus dollars being pumped through banks. So some of the US case we're looking for banks is, which small medium business in especially, in unsecured lending, which business to lend to, (mumbles) there's so many applications that have come to banks across the world, it's not just in the US, and banks are caught up with the problem of which and what's growing the concern for this business to kind of, are they really accurate about the number of employees they are saying they have? Do then the next level problem or on forbearance and mortgage, that side of the things are coming up at some of these banks as well. So they're looking at which, what's one of the problems that one of our customers Wells Fargo, they have a question which branch to open, right, sort of that itself, it needs a different kind of modeling. So everything has become a very highly good segmented models, and so AI is absolutely not just a good to have, it has become a must have for most of our customers in how to go about their business. (mumbles) >> I want to talk a little bit about your business, you have been on a mission to democratize AI since the beginning, open source. Explain your business model, how you guys make money and then I want to help people understand basic theoretical comparisons and current affairs. >> Yeah, that's great. I think the last time we spoke, probably about at the Spark Summit. I think Dave and we were talking about Sparkling Water and H2O our open source platforms, which are premium platforms for democratizing machine learning and math at scale, and that's been a tremendous brand for us. Over the last couple of years, we have essentially built a platform called Driverless AI, which is a license software and that automates machine learning models, we took the best practices of all these data scientists, and combined them to essentially build recipes that allow people to build the best forecasting models, best fraud prevention models or the best recommendation engines, and so we started augmenting traditional data scientists with this automatic machine learning called AutoML, that essentially allows them to build models without necessarily having the same level of talent as these great Kaggle Grand Masters. And so that has democratized, allowed ordinary companies to start producing models of high caliber and high quality that would otherwise have been the pedigree of Google, Microsoft or Amazon or some of these top tier AI houses like Netflix and others. So what we've done is democratize not just the algorithms at the open source level. Now, we've made it easy for kind of rapid adoption of AI across every branch inside a company, a large organization, also across smaller organizations which don't have the access to the same kind of talent. Now, third level, you know, what we've brought to market, is ability to augment data sets, especially public and private data sets that you can, the alternative data sets that can increase the signal. And that's where we've started working on a new platform called Q, again, more license software, and I mean, to give you an idea there from business models endpoint, now majority of our software sales is coming from closed source software. And sort of so, we've made that transition, we still make our open source widely accessible, we continue to improve it, a large chunk of the teams are improving and participating in building the communities but I think from a business model standpoint as of last year, 51% of our revenues are now coming from closed source software and that change is continuing to grow. >> And this is the point I wanted to get to, so you know, the open source model was you know, Red Hat the one company that, you know, succeeded wildly and it was, put it out there open source, come up with a service, maintain the software, you got to buy the subscription okay, fine. And everybody thought that you know, you were going to do that, they thought that Databricks was going to do and that changed. But I want to take two examples, Hortonworks which kind of took the Red Hat model and Cloudera which does IP. And neither really lived up to the expectation, but now there seems to be sort of a new breed I mentioned, you guys, Databricks, there are others, that seem to be working. You with your license software model, Databricks with a managed service and so there's, it's becoming clear that there's got to be some level of IP that can be licensed in order to really thrive in the open source community to be able to fund the committers that you have to put forth to open source. I wonder if you could give me your thoughts on that narrative. >> So on Driverless AI, which is the closest platform I mentioned, we opened up the layers in open source as recipes. So for example, different companies build their zip codes differently, right, the domain specific recipes, we put about 150 of them in open source again, on top of our Driverless AI platform, and the idea there is that, open source is about freedom, right? It is not necessarily about, it's not a philosophy, it's not a business model, it allows freedom for rapid adoption of a platform and complete democratization and commodification of a space. And that allows a small company like ours to compete at the level of an SaaS or a Google or a Microsoft because you have the same level of voice as a very large company and you're focused on using code as a community building exercise as opposed to a business model, right? So that's kind of the heart of open source, is allowing that freedom for our end users and the customers to kind of innovate at the same level of that a Silicon Valley company or one of these large tech giants are building software. So it's really about making, it's a maker culture, as opposed to a consumer culture around software. Now, if you look at how the the Red Hat model, and the others who have tried to replicate that, the difficult part there was, if the product is very good, customers are self sufficient and if it becomes a standard, then customers know how to use it. If the product is crippled or difficult to use, then you put a lot of services and that's where you saw the classic Hadoop companies, get pulled into a lot of services, which is a reasonably difficult business to scale. So I think what we chose was, instead, a great product that builds a fantastic brand, that makes AI, even when other first or second.ai domain, and for us to see thousands of companies which are not AI and AI first, and even more companies adopting AI and talking about AI as a major way that was possible because of open source. If you had chosen close source and many of your peers did, they all vanished. So that's kind of how the open source is really about building the ecosystem and having the patience to build a company that takes 10, 20 years to build. And what we are expecting unfortunately, is a first and fast rise up to become unicorns. In that race, you're essentially sacrifice, building a long ecosystem play, and that's kind of what we chose to do, and that took a little longer. Now, if you think about the, how do you truly monetize open source, it takes a little longer and is much more difficult sales machine to scale, right, sort of. Our open source business actually is reasonably positive EBITDA business because it makes more money than we spend on it. But trying to teach sales teams, how to sell open source, that's a much, that's a rate limiting step. And that's why we chose and also explaining to the investors, how open source is being invested in as you go closer to the IPO markets, that's where we chose, let's go into license software model and scale that as a regular business. >> So I've said a few times, it's kind of like ironic that, this pandemic is as we're entering a new decade, you know, we've kind of we're exiting the era, I mean, the many, many decades of Moore's law being the source of innovation and now it's a combination of data, applying machine intelligence and being able to scale and with cloud. Well, my question is, what did we expect out of AI this decade if those are sort of the three, the cocktail of innovation, if you will, what should we expect? Is it really just about, I suggest, is it really about automating, you know, businesses, giving them more agility, flexibility, you know, etc. Or should we should we expect more from AI this decade? >> Well, I mean, if you think about the decade of 2010 2011, that was defined by software is eating the world, right? And now you can say software is the world, right? I mean, pretty much almost all conditions are digital. And AI is eating software, right? (mumbling) A lot of cloud transitions are happening and are now happening much faster rate but cloud and AI are kind of the leading, AI is essentially one of the biggest driver for cloud adoption for many of our customers. So in the enterprise world, you're seeing rebuilding of a lot of data, fast data driven applications that use AI, instead of rule based software, you're beginning to see patterned, mission AI based software, and you're seeing that in spades. And, of course, that is just the tip of the iceberg, AI has been with us for 100 years, and it's going to be ahead of us another hundred years, right, sort of. So as you see the discovery rate at which, it is really a fundamentally a math, math movement and in that math movement at the beginning of every century, it leads to 100 years of phenomenal discovery. So AI is essentially making discoveries faster, AI is producing, entertainment, AI is producing music, AI is producing choreographing, you're seeing AI in every walk of life, AI summarization of Zoom meetings, right, you beginning to see a lot of the AI enabled ETF peaking of stocks, right, sort of. You're beginning to see, we repriced 20,000 bonds every 15 seconds using H2O AI, corporate bonds. And so you and one of our customers is on the fastest growing stock, mostly AI is powering a lot of these insights in a fast changing world which is globally connected. No one of us is able to combine all the multiple dimensions that are changing and AI has that incredible opportunity to be a partner for every... (mumbling) For a hospital looking at how the second half will look like for physicians looking at what is the sentiment of... What is the surge to expect? To kind of what is the market demand looking at the sentiment of the customers. AI is the ultimate money ball in business and then I think it's just showing its depth at this point. >> Yeah, I mean, I think you're right on, I mean, basically AI is going to convert every software, every application, or those tools aren't going to have much use, Sri we got to go but thanks so much for coming to theCUBE and the great work you guys are doing. Really appreciate your insights. stay safe, and best of luck to you guys. >> Likewise, thank you so much. >> Welcome, and thank you for watching everybody, this is Dave Vellante for the CXO series on theCUBE. We'll see you next time. All right, we're clear. All right.
SUMMARY :
Sri, it's great to see you Your thought as to what you're and a lot of application and if people criticize the models, and kind of educate the community and then let public policy you know, and starting to kind of inform them What is the data telling you of the entire community, and improve on the models? and the kind of the airlines and then I want to help people understand and I mean, to give you an idea there in the open source community to be able and the customers to kind of innovate and being able to scale and with cloud. What is the surge to expect? and the great work you guys are doing. Welcome, and thank you
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Dave | PERSON | 0.99+ |
2008 | DATE | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
Wells Fargo | ORGANIZATION | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
ORGANIZATION | 0.99+ | |
San Francisco | LOCATION | 0.99+ |
Prague | LOCATION | 0.99+ |
Brooklyn | LOCATION | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
51% | QUANTITY | 0.99+ |
May 2020 | DATE | 0.99+ |
China | LOCATION | 0.99+ |
United States | LOCATION | 0.99+ |
100 years | QUANTITY | 0.99+ |
Bronx | LOCATION | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
Manhattan | LOCATION | 0.99+ |
US | LOCATION | 0.99+ |
Santa Clara | LOCATION | 0.99+ |
last year | DATE | 0.99+ |
10% | QUANTITY | 0.99+ |
20,000 bonds | QUANTITY | 0.99+ |
Imperial College London | ORGANIZATION | 0.99+ |
Hortonworks | ORGANIZATION | 0.99+ |
One | QUANTITY | 0.99+ |
COVID-19 | OTHER | 0.99+ |
Los Angeles | LOCATION | 0.99+ |
Netflix | ORGANIZATION | 0.99+ |
H20 | ORGANIZATION | 0.99+ |
Red Hat | ORGANIZATION | 0.99+ |
South Korea | LOCATION | 0.99+ |
Sri Satish Ambati | PERSON | 0.99+ |
thousands | QUANTITY | 0.99+ |
FEMA | ORGANIZATION | 0.99+ |
Brazil | LOCATION | 0.99+ |
second half | QUANTITY | 0.99+ |
first | QUANTITY | 0.99+ |
second surge | QUANTITY | 0.99+ |
two months | QUANTITY | 0.99+ |
one | QUANTITY | 0.98+ |
second bump | QUANTITY | 0.98+ |
two things | QUANTITY | 0.98+ |
H2O | ORGANIZATION | 0.98+ |
both | QUANTITY | 0.98+ |
Czech Republic | LOCATION | 0.98+ |
Silicon Valley | LOCATION | 0.98+ |
TITLE | 0.98+ | |
three | QUANTITY | 0.98+ |
hundred years | QUANTITY | 0.98+ |
once a year | QUANTITY | 0.97+ |
Powell | PERSON | 0.97+ |
Sparkling Water | ORGANIZATION | 0.97+ |
Alipay | TITLE | 0.97+ |
Norway | LOCATION | 0.97+ |
pandemic | EVENT | 0.97+ |
second order | QUANTITY | 0.97+ |
third level | QUANTITY | 0.97+ |
first folks | QUANTITY | 0.97+ |
COVID-19 crisis | EVENT | 0.96+ |
Fed | ORGANIZATION | 0.95+ |
1918 | DATE | 0.95+ |
later this month | DATE | 0.95+ |
one side | QUANTITY | 0.94+ |
Sri Ambati | PERSON | 0.94+ |
two examples | QUANTITY | 0.93+ |
Moore | PERSON | 0.92+ |
Californians | PERSON | 0.92+ |
CXO | TITLE | 0.92+ |
last couple of months | DATE | 0.92+ |
COVID | OTHER | 0.91+ |
Spark Summit | EVENT | 0.91+ |
one step | QUANTITY | 0.91+ |
The Hammer | TITLE | 0.9+ |
COVID crisis | EVENT | 0.87+ |
every 15 seconds | QUANTITY | 0.86+ |
Sri Satish Ambati, H20.ai | CUBE Conversation, May 2020
>> Starting the record, Dave in five, four, three. Hi, everybody this is Dave Vellante, theCUBE, and welcome back to my CXO series. I've been running this through really since the start of the COVID-19 crisis to really understand how leaders are dealing with this pandemic. Sri Ambati is here, he's the CEO and founder of H20. Sri, it's great to see you again, thanks for coming on. >> Thank you for having us. >> Yeah, so this pandemic has obviously given people fits, no question, but it's also given opportunities for companies to kind of reassess where they are. Automation is a huge watchword, flexibility, business resiliency and people who maybe really hadn't fully leaned into things like the cloud and AI and automation are now realizing, wow, we have no choice, it's about survival. Your thought as to what you're seeing in the marketplace. >> Thanks for having us. I think first of all, kudos to the frontline health workers who have been ruthlessly saving lives across the country and the world, and what you're really doing is a fraction of what we could have done or should be doing to stay away the next big pandemic. But that apart I think, I usually tend to say BC is before COVID. So if the world was thinking about going digital after COVID-19, they have been forced to go digital and as a result, you're seeing tremendous transformation across our customers, and a lot of application to kind of go in and reinvent their business models that allow them to scale as effortlessly as they could using the digital means. >> So, think about, doctors and diagnosis machines, in some cases, are helping doctors make diagnoses, they're sometimes making even better diagnosis, (mumbles) is informing. There's been a lot of talk about the models, you know how... Yeah, I know you've been working with a lot of healthcare organizations, you may probably familiar with that, you know, the Medium post, The Hammer and the Dance, and if people criticize the models, of course, they're just models, right? And you iterate models and machine intelligence can help us improve. So, in this, you know, you talk about BC and post C, how have you seen the data and in machine intelligence informing the models and proving that what we know about this pandemic, I mean, it changed literally daily, what are you seeing? >> Yeah, and I think it started with Wuhan and we saw the best application of AI in trying to trace, literally from Alipay, to WeChat, track down the first folks who were spreading it across China and then eventually the rest of the world. I think contact tracing, for example, has become a really interesting problem. supply chain has been disrupted like never before. We're beginning to see customers trying to reinvent their distribution mechanisms in the second order effects of the COVID, and the the prime center is hospital staffing, how many ventilator, is the first few weeks so that after COVID crisis as it evolved in the US. We are busy predicting working with some of the local healthcare communities to predict how staffing in hospitals will work, how many PPE and ventilators will be needed and so henceforth, but that quickly and when the peak surge will be those with the beginning problems, and many of our customers have begin to do these models and iterate and improve and kind of educate the community to practice social distancing, and that led to a lot of flattening the curve and you're talking flattening the curve, you're really talking about data science and analytics in public speak. That led to kind of the next level, now that we have somewhat brought a semblance of order to the reaction to COVID, I think what we are beginning to figure out is, is there going to be a second surge, what elective procedures that were postponed, will be top of the mind for customers, and so this is the kind of things that hospitals are beginning to plan out for the second half of the year, and as businesses try to open up, certain things were highly correlated to surgeon cases, such as cleaning supplies, for example, the obvious one or pantry buying. So retailers are beginning to see what online stores are doing well, e-commerce, online purchases, electronic goods, and so everyone essentially started working from home, and so homes needed to have the same kind of bandwidth that offices and commercial enterprises needed to have, and so a lot of interesting, as one side you saw airlines go away, this side you saw the likes of Zoom and video take off. So you're kind of seeing a real divide in the digital divide and that's happening and AI is here to play a very good role to figure out how to enhance your profitability as you're looking about planning out the next two years. >> Yeah, you know, and obviously, these things they get, they get partisan, it gets political, I mean, our job as an industry is to report, your job is to help people understand, I mean, let the data inform and then let public policy you know, fight it out. So who are some of the people that you're working with that you know, as a result of COVID-19. What's some of the work that H2O has done, I want to better understand what role are you playing? >> So one of the things we're kind of privileged as a company to come into the crisis, with a strong balance and an ability to actually have the right kind of momentum behind the company in terms of great talent, and so we have 10% of the world's top data scientists in the in the form of Kaggle Grand Masters in the company. And so we put most of them to work, and they started collecting data sets, curating data sets and making them more qualitative, picking up public data sources, for example, there's a tremendous amount of job loss out there, figuring out which are the more difficult kind of sectors in the economy and then we started looking at exodus from the cities, we're looking at mobility data that's publicly available, mobility data through the data exchanges, you're able to find which cities which rural areas, did the New Yorkers as they left the city, which places did they go to, and what's to say, Californians when they left Los Angeles, which are the new places they have settled in? These are the places which are now busy places for the same kind of items that you need to sell if you're a retailer, but if you go one step further, we started engaging with FEMA, we start engaging with the universities, like Imperial College London or Berkeley, and started figuring out how best to improve the models and automate them. The SaaS model, the most popular SaaS model, we added that into our Driverless AI product as a recipe and made that accessible to our customers in testing, to customers in healthcare who are trying to predict where the surge is likely to come. But it's mostly about information right? So the AI at the end of it is all about intelligence and being prepared. Predictive is all about being prepared and that's kind of what we did with general, lots of blogs, typical blog articles and working with the largest health organizations and starting to kind of inform them on the most stable models. What we found to our not so much surprise, is that the simplest, very interpretable models are actually the most widely usable, because historical data is actually no longer as effective. You need to build a model that you can quickly understand and retry again to the feedback loop of back testing that model against what really happened. >> Yeah, so I want to double down on that. So really, two things I want to understand, if you have visibility on it, sounds like you do. Just in terms of the surge and the comeback, you know, kind of what those models say, based upon, you know, we have some advanced information coming from the global market, for sure, but it seems like every situation is different. What's the data telling you? Just in terms of, okay, we're coming into the spring and the summer months, maybe it'll come down a little bit. Everybody says it... We fully expect it to come back in the fall, go back to college, don't go back to college. What is the data telling you at this point in time with an understanding that, you know, we're still iterating every day? >> Well, I think I mean, we're not epidemiologists, but at the same time, the science of it is a highly local response, very hyper local response to COVID-19 is what we've seen. Santa Clara, which is just a county, I mean, is different from San Francisco, right, sort of. So you beginning to see, like we saw in Brooklyn, it's very different, and Bronx, very different from Manhattan. So you're seeing a very, very local response to this disease, and I'm talking about US. You see the likes of Brazil, which we're worried about, has picked up quite a bit of cases now. I think the silver lining I would say is that China is up and running to a large degree, a large number of our user base there are back active, you can see the traffic patterns there. So two months after their last research cases, the business and economic activity is back and thriving. And so, you can kind of estimate from that, that this can be done where you can actually contain the rise of active cases and it will take masking of the entire community, masking and the healthy dose of increase in testing. One of our offices is in Prague, and Czech Republic has done an incredible job in trying to contain this and they've done essentially, masked everybody and as a result they're back thinking about opening offices, schools later this month. So I think that's a very, very local response, hyper local response, no one country and no one community is symmetrical with other ones and I think we have a unique situation where in United States you have a very, very highly connected world, highly connected economy and I think we have quite a problem on our hands on how to safeguard our economy while also safeguarding life. >> Yeah, so you can't just, you can't just take Norway and apply it or South Korea and apply it, every situation is different. And then I want to ask you about, you know, the economy in terms of, you know, how much can AI actually, you know, how can it work in this situation where you have, you know, for example, okay, so the Fed, yes, it started doing asset buys back in 2008 but still, very hard to predict, I mean, at this time of this interview you know, Stock Market up 900 points, very difficult to predict that but some event happens in the morning, somebody, you know, Powell says something positive and it goes crazy but just sort of even modeling out the V recovery, the W recovery, deep recession, the comeback. You have to have enough data, do you not? In order for AI to be reasonably accurate? How does it work? And how does at what pace can you iterate and improve on the models? >> So I think that's exactly where I would say, continuous modeling, instead of continuously learning continuous, that's where the vision of the world is headed towards, where data is coming, you build a model, and then you iterate, try it out and come back. That kind of rapid, continuous learning would probably be needed for all our models as opposed to the typical, I'm pushing a model to production once a year, or once every quarter. I think what we're beginning to see is the kind of where companies are beginning to kind of plan out. A lot of people lost their jobs in the last couple of months, right, sort of. And so up scaling and trying to kind of bring back these jobs back both into kind of, both from the manufacturing side, but also lost a lot of jobs in the transportation and the kind of the airlines slash hotel industries, right, sort of. So it's trying to now bring back the sense of confidence and will take a lot more kind of testing, a lot more masking, a lot more social empathy, I think well, some of the things that we are missing while we are socially distant, we know that we are so connected as a species, we need to kind of start having that empathy for we need to wear a mask, not for ourselves, but for our neighbors and people we may run into. And I think that kind of, the same kind of thinking has to kind of parade, before we can open up the economy in a big way. The data, I mean, we can do a lot of transfer learning, right, sort of there are new methods, like try to model it, similar to the 1918, where we had a second bump, or a lot of little bumps, and that's kind of where your W shaped pieces, but governments are trying very well in seeing stimulus dollars being pumped through banks. So some of the US case we're looking for banks is, which small medium business in especially, in unsecured lending, which business to lend to, (mumbles) there's so many applications that have come to banks across the world, it's not just in the US, and banks are caught up with the problem of which and what's growing the concern for this business to kind of, are they really accurate about the number of employees they are saying they have? Do then the next level problem or on forbearance and mortgage, that side of the things are coming up at some of these banks as well. So they're looking at which, what's one of the problems that one of our customers Wells Fargo, they have a question which branch to open, right, sort of that itself, it needs a different kind of modeling. So everything has become a very highly good segmented models, and so AI is absolutely not just a good to have, it has become a must have for most of our customers in how to go about their business. (mumbles) >> I want to talk a little bit about your business, you have been on a mission to democratize AI since the beginning, open source. Explain your business model, how you guys make money and then I want to help people understand basic theoretical comparisons and current affairs. >> Yeah, that's great. I think the last time we spoke, probably about at the Spark Summit. I think Dave and we were talking about Sparkling Water and H2O or open source platforms, which are premium platforms for democratizing machine learning and math at scale, and that's been a tremendous brand for us. Over the last couple of years, we have essentially built a platform called Driverless AI, which is a license software and that automates machine learning models, we took the best practices of all these data scientists, and combined them to essentially build recipes that allow people to build the best forecasting models, best fraud prevention models or the best recommendation engines, and so we started augmenting traditional data scientists with this automatic machine learning called AutoML, that essentially allows them to build models without necessarily having the same level of talent as these Greek Kaggle Grand Masters. And so that has democratized, allowed ordinary companies to start producing models of high caliber and high quality that would otherwise have been the pedigree of Google, Microsoft or Amazon or some of these top tier AI houses like Netflix and others. So what we've done is democratize not just the algorithms at the open source level. Now, we've made it easy for kind of rapid adoption of AI across every branch inside a company, a large organization, also across smaller organizations which don't have the access to the same kind of talent. Now, third level, you know, what we've brought to market, is ability to augment data sets, especially public and private data sets that you can, the alternative data sets that can increase the signal. And that's where we've started working on a new platform called Q, again, more license software, and I mean, to give you an idea there from business models endpoint, now majority of our software sales is coming from closed source software. And sort of so, we've made that transition, we still make our open source widely accessible, we continue to improve it, a large chunk of the teams are improving and participating in building the communities but I think from a business model standpoint as of last year, 51% of our revenues are now coming from closed source software and that change is continuing to grow. >> And this is the point I wanted to get to, so you know, the open source model was you know, Red Hat the one company that, you know, succeeded wildly and it was, put it out there open source, come up with a service, maintain the software, you got to buy the subscription okay, fine. And everybody thought that you know, you were going to do that, they thought that Databricks was going to do and that changed. But I want to take two examples, Hortonworks which kind of took the Red Hat model and Cloudera which does IP. And neither really lived up to the expectation, but now there seems to be sort of a new breed I mentioned, you guys, Databricks, there are others, that seem to be working. You with your license software model, Databricks with a managed service and so there's, it's becoming clear that there's got to be some level of IP that can be licensed in order to really thrive in the open source community to be able to fund the committers that you have to put forth to open source. I wonder if you could give me your thoughts on that narrative. >> So on Driverless AI, which is the closest platform I mentioned, we opened up the layers in open source as recipes. So for example, different companies build their zip codes differently, right, the domain specific recipes, we put about 150 of them in open source again, on top of our Driverless AI platform, and the idea there is that, open source is about freedom, right? It is not necessarily about, it's not a philosophy, it's not a business model, it allows freedom for rapid adoption of a platform and complete democratization and commodification of a space. And that allows a small company like ours to compete at the level of an SaaS or a Google or a Microsoft because you have the same level of voice as a very large company and you're focused on using code as a community building exercise as opposed to a business model, right? So that's kind of the heart of open source, is allowing that freedom for our end users and the customers to kind of innovate at the same level of that a Silicon Valley company or one of these large tech giants are building software. So it's really about making, it's a maker culture, as opposed to a consumer culture around software. Now, if you look at how the the Red Hat model, and the others who have tried to replicate that, the difficult part there was, if the product is very good, customers are self sufficient and if it becomes a standard, then customers know how to use it. If the product is crippled or difficult to use, then you put a lot of services and that's where you saw the classic Hadoop companies, get pulled into a lot of services, which is a reasonably difficult business to scale. So I think what we chose was, instead, a great product that builds a fantastic brand, that makes AI, even when other first or second.ai domain, and for us to see thousands of companies which are not AI and AI first, and even more companies adopting AI and talking about AI as a major way that was possible because of open source. If you had chosen close source and many of your peers did, they all vanished. So that's kind of how the open source is really about building the ecosystem and having the patience to build a company that takes 10, 20 years to build. And what we are expecting unfortunately, is a first and fast rise up to become unicorns. In that race, you're essentially sacrifice, building a long ecosystem play, and that's kind of what we chose to do, and that took a little longer. Now, if you think about the, how do you truly monetize open source, it takes a little longer and is much more difficult sales machine to scale, right, sort of. Our open source business actually is reasonably positive EBITDA business because it makes more money than we spend on it. But trying to teach sales teams, how to sell open source, that's a much, that's a rate limiting step. And that's why we chose and also explaining to the investors, how open source is being invested in as you go closer to the IPO markets, that's where we chose, let's go into license software model and scale that as a regular business. >> So I've said a few times, it's kind of like ironic that, this pandemic is as we're entering a new decade, you know, we've kind of we're exiting the era, I mean, the many, many decades of Moore's law being the source of innovation and now it's a combination of data, applying machine intelligence and being able to scale and with cloud. Well, my question is, what did we expect out of AI this decade if those are sort of the three, the cocktail of innovation, if you will, what should we expect? Is it really just about, I suggest, is it really about automating, you know, businesses, giving them more agility, flexibility, you know, etc. Or should we should we expect more from AI this decade? >> Well, I mean, if you think about the decade of 2010 2011, that was defined by software is eating the world, right? And now you can say software is the world, right? I mean, pretty much almost all conditions are digital. And AI is eating software, right? (mumbling) A lot of cloud transitions are happening and are now happening much faster rate but cloud and AI are kind of the leading, AI is essentially one of the biggest driver for cloud adoption for many of our customers. So in the enterprise world, you're seeing rebuilding of a lot of data, fast data driven applications that use AI, instead of rule based software, you're beginning to see patterned, mission AI based software, and you're seeing that in spades. And, of course, that is just the tip of the iceberg, AI has been with us for 100 years, and it's going to be ahead of us another hundred years, right, sort of. So as you see the discovery rate at which, it is really a fundamentally a math, math movement and in that math movement at the beginning of every century, it leads to 100 years of phenomenal discovery. So AI is essentially making discoveries faster, AI is producing, entertainment, AI is producing music, AI is producing choreographing, you're seeing AI in every walk of life, AI summarization of Zoom meetings, right, you beginning to see a lot of the AI enabled ETF peaking of stocks, right, sort of. You're beginning to see, we repriced 20,000 bonds every 15 seconds using H2O AI, corporate bonds. And so you and one of our customers is on the fastest growing stock, mostly AI is powering a lot of these insights in a fast changing world which is globally connected. No one of us is able to combine all the multiple dimensions that are changing and AI has that incredible opportunity to be a partner for every... (mumbling) For a hospital looking at how the second half will look like for physicians looking at what is the sentiment of... What is the surge to expect? To kind of what is the market demand looking at the sentiment of the customers. AI is the ultimate money ball in business and then I think it's just showing its depth at this point. >> Yeah, I mean, I think you're right on, I mean, basically AI is going to convert every software, every application, or those tools aren't going to have much use, Sri we got to go but thanks so much for coming to theCUBE and the great work you guys are doing. Really appreciate your insights. stay safe, and best of luck to you guys. >> Likewise, thank you so much. >> Welcome, and thank you for watching everybody, this is Dave Vellante for the CXO series on theCUBE. We'll see you next time. All right, we're clear. All right.
SUMMARY :
Sri, it's great to see you Your thought as to what you're and a lot of application and if people criticize the models, and kind of educate the community and then let public policy you know, is that the simplest, What is the data telling you of the entire community, and improve on the models? and the kind of the airlines and then I want to help people understand and I mean, to give you an idea there in the open source community to be able and the customers to kind of innovate and being able to scale and with cloud. What is the surge to expect? and the great work you guys are doing. Welcome, and thank you
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Wells Fargo | ORGANIZATION | 0.99+ |
Dave | PERSON | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
2008 | DATE | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
five | QUANTITY | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
ORGANIZATION | 0.99+ | |
Brooklyn | LOCATION | 0.99+ |
Prague | LOCATION | 0.99+ |
China | LOCATION | 0.99+ |
Bronx | LOCATION | 0.99+ |
100 years | QUANTITY | 0.99+ |
May 2020 | DATE | 0.99+ |
Manhattan | LOCATION | 0.99+ |
51% | QUANTITY | 0.99+ |
US | LOCATION | 0.99+ |
Brazil | LOCATION | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
United States | LOCATION | 0.99+ |
COVID-19 | OTHER | 0.99+ |
10% | QUANTITY | 0.99+ |
20,000 bonds | QUANTITY | 0.99+ |
Los Angeles | LOCATION | 0.99+ |
last year | DATE | 0.99+ |
H20 | ORGANIZATION | 0.99+ |
Imperial College London | ORGANIZATION | 0.99+ |
Silicon Valley | LOCATION | 0.99+ |
one | QUANTITY | 0.99+ |
four | QUANTITY | 0.99+ |
Santa Clara | LOCATION | 0.99+ |
One | QUANTITY | 0.99+ |
hundred years | QUANTITY | 0.99+ |
Hortonworks | ORGANIZATION | 0.99+ |
Netflix | ORGANIZATION | 0.99+ |
Sri Satish Ambati | PERSON | 0.99+ |
South Korea | LOCATION | 0.99+ |
three | QUANTITY | 0.99+ |
second half | QUANTITY | 0.99+ |
two things | QUANTITY | 0.99+ |
Red Hat | ORGANIZATION | 0.99+ |
both | QUANTITY | 0.98+ |
second surge | QUANTITY | 0.98+ |
first | QUANTITY | 0.98+ |
H2O | ORGANIZATION | 0.98+ |
third level | QUANTITY | 0.98+ |
once a year | QUANTITY | 0.98+ |
Sparkling Water | ORGANIZATION | 0.98+ |
FEMA | ORGANIZATION | 0.98+ |
TITLE | 0.98+ | |
pandemic | EVENT | 0.98+ |
Powell | PERSON | 0.97+ |
COVID-19 crisis | EVENT | 0.97+ |
second bump | QUANTITY | 0.97+ |
Czech Republic | LOCATION | 0.96+ |
second order | QUANTITY | 0.96+ |
1918 | DATE | 0.96+ |
Norway | LOCATION | 0.96+ |
Fed | ORGANIZATION | 0.95+ |
first folks | QUANTITY | 0.94+ |
thousands of companies | QUANTITY | 0.94+ |
two examples | QUANTITY | 0.91+ |
10, 20 years | QUANTITY | 0.91+ |
COVID | OTHER | 0.91+ |
CXO | TITLE | 0.91+ |
two months | QUANTITY | 0.91+ |
last couple of months | DATE | 0.9+ |
Moore | PERSON | 0.9+ |
later this month | DATE | 0.9+ |
Alipay | TITLE | 0.89+ |
Sri Ambati | PERSON | 0.88+ |
every 15 seconds | QUANTITY | 0.88+ |
COVID crisis | EVENT | 0.86+ |
Californians | PERSON | 0.85+ |
Driverless | TITLE | 0.84+ |
Joel Horwitz, IBM | IBM CDO Summit Sping 2018
(techno music) >> Announcer: Live, from downtown San Francisco, it's theCUBE. Covering IBM Chief Data Officer Strategy Summit 2018. Brought to you by IBM. >> Welcome back to San Francisco everybody, this is theCUBE, the leader in live tech coverage. We're here at the Parc 55 in San Francisco covering the IBM CDO Strategy Summit. I'm here with Joel Horwitz who's the Vice President of Digital Partnerships & Offerings at IBM. Good to see you again Joel. >> Thanks, great to be here, thanks for having me. >> So I was just, you're very welcome- It was just, let's see, was it last month, at Think? >> Yeah, it's hard to keep track, right. >> And we were talking about your new role- >> It's been a busy year. >> the importance of partnerships. One of the things I want to, well let's talk about your role, but I really want to get into, it's innovation. And we talked about this at Think, because it's so critical, in my opinion anyway, that you can attract partnerships, innovation partnerships, startups, established companies, et cetera. >> Joel: Yeah. >> To really help drive that innovation, it takes a team of people, IBM can't do it on its own. >> Yeah, I mean look, IBM is the leader in innovation, as we all know. We're the market leader for patents, that we put out each year, and how you get that technology in the hands of the real innovators, the developers, the longtail ISVs, our partners out there, that's the challenging part at times, and so what we've been up to is really looking at how we make it easier for partners to partner with IBM. How we make it easier for developers to work with IBM. So we have a number of areas that we've been adding, so for example, we've added a whole IBM Code portal, so if you go to developer.ibm.com/code you can actually see hundreds of code patterns that we've created to help really any client, any partner, get started using IBM's technology, and to innovate. >> Yeah, and that's critical, I mean you're right, because to me innovation is a combination of invention, which is what you guys do really, and then it's adoption, which is what your customers are all about. You come from the data science world. We're here at the Chief Data Officer Summit, what's the intersection between data science and CDOs? What are you seeing there? >> Yeah, so when I was here last, it was about two years ago in 2015, actually, maybe three years ago, man, time flies when you're having fun. >> Dave: Yeah, the Spark Summit- >> Yeah Spark Technology Center and the Spark Summit, and we were here, I was here at the Chief Data Officer Summit. And it was great, and at that time, I think a lot of the conversation was really not that different than what I'm seeing today. Which is, how do you manage all of your data assets? I think a big part of doing good data science, which is my kind of background, is really having a good understanding of what your data governance is, what your data catalog is, so, you know we introduced the Watson Studio at Think, and actually, what's nice about that, is it brings a lot of this together. So if you look in the market, in the data market, today, you know we used to segment it by a few things, like data gravity, data movement, data science, and data governance. And those are kind of the four themes that I continue to see. And so outside of IBM, I would contend that those are relatively separate kind of tools that are disconnected, in fact Dinesh Nirmal, who's our engineer on the analytic side, Head of Development there, he wrote a great blog just recently, about how you can have some great machine learning, you have some great data, but if you can't operationalize that, then really you can't put it to use. And so it's funny to me because we've been focused on this challenge, and IBM is making the right steps, in my, I'm obviously biased, but we're making some great strides toward unifying the, this tool chain. Which is data management, to data science, to operationalizing, you know, machine learning. So that's what we're starting to see with Watson Studio. >> Well, I always push Dinesh on this and like okay, you've got a collection of tools, but are you bringing those together? And he flat-out says no, we developed this, a lot of this from scratch. Yes, we bring in the best of the knowledge that we have there, but we're not trying to just cobble together a bunch of disparate tools with a UI layer. >> Right, right. >> It's really a fundamental foundation that you're trying to build. >> Well, what's really interesting about that, that piece, is that yeah, I think a lot of folks have cobbled together a UI layer, so we formed a partnership, coming back to the partnership view, with a company called Lightbend, who's based here in San Francisco, as well as in Europe, and the reason why we did that, wasn't just because of the fact that Reactive development, if you're not familiar with Reactive, it's essentially Scala, Akka, Play, this whole framework, that basically allows developers to write once, and it kind of scales up with demand. In fact, Verizon actually used our platform with Lightbend to launch the iPhone 10. And they show dramatic improvements. Now what's exciting about Lightbend, is the fact that application developers are developing with Reactive, but if you turn around, you'll also now be able to operationalize models with Reactive as well. Because it's basically a single platform to move between these two worlds. So what we've continued to see is data science kind of separate from the application world. Really kind of, AI and cloud as different universes. The reality is that for any enterprise, or any company, to really innovate, you have to find a way to bring those two worlds together, to get the most use out of it. >> Fourier always says "Data is the new development kit". He said this I think five or six years ago, and it's barely becoming true. You guys have tried to make an attempt, and have done a pretty good job, of trying to bring those worlds together in a single platform, what do you call it? The Watson Data Platform? >> Yeah, Watson Data Platform, now Watson Studio, and I think the other, so one side of it is, us trying to, not really trying, but us actually bringing together these disparate systems. I mean we are kind of a systems company, we're IT. But not only that, but bringing our trained algorithms, and our trained models to the developers. So for example, we also did a partnership with Unity, at the end of last year, that's now just reaching some pretty good growth, in terms of bringing the Watson SDK to game developers on the Unity platform. So again, it's this idea of bringing the game developer, the application developer, in closer contact with these trained models, and these trained algorithms. And that's where you're seeing incredible things happen. So for example, Star Trek Bridge Crew, which I don't know how many Trekkies we have here at the CDO Summit. >> A few over here probably. >> Yeah, a couple? They're using our SDK in Unity, to basically allow a gamer to use voice commands through the headset, through a VR headset, to talk to other players in the virtual game. So we're going to see more, I can't really disclose too much what we're doing there, but there's some cool stuff coming out of that partnership. >> Real immersive experience driving a lot of data. Now you're part of the Digital Business Group. I like the term digital business, because we talk about it all the time. Digital business, what's the difference between a digital business and a business? What's the, how they use data. >> Joel: Yeah. >> You're a data person, what does that mean? That you're part of the Digital Business Group? Is that an internal facing thing? An external facing thing? Both? >> It's really both. So our Chief Digital Officer, Bob Lord, he has a presentation that he'll give, where he starts out, and he goes, when I tell people I'm the Chief Digital Officer they usually think I just manage the website. You know, if I tell people I'm a Chief Data Officer, it means I manage our data, in governance over here. The reality is that I think these Chief Digital Officer, Chief Data Officer, they're really responsible for business transformation. And so, if you actually look at what we're doing, I think on both sides is we're using data, we're using marketing technology, martech, like Optimizely, like Segment, like some of these great partners of ours, to really look at how we can quickly A/B test, get user feedback, to look at how we actually test different offerings and market. And so really what we're doing is we're setting up a testing platform, to bring not only our traditional offers to market, like DB2, Mainframe, et cetera, but also bring new offers to market, like blockchain, and quantum, and others, and actually figure out how we get better product-market fit. What actually, one thing, one story that comes to mind, is if you've seen the movie Hidden Figures- >> Oh yeah. >> There's this scene where Kevin Costner, I know this is going to look not great for IBM, but I'm going to say it anyways, which is Kevin Costner has like a sledgehammer, and he's like trying to break down the wall to get the mainframe in the room. That's what it feels like sometimes, 'cause we create the best technology, but we forget sometimes about the last mile. You know like, we got to break down the wall. >> Where am I going to put it? >> You know, to get it in the room! So, honestly I think that's a lot of what we're doing. We're bridging that last mile, between these different audiences. So between developers, between ISVs, between commercial buyers. Like how do we actually make this technology, not just accessible to large enterprise, which are our main clients, but also to the other ecosystems, and other audiences out there. >> Well so that's interesting Joel, because as a potential partner of IBM, they want, obviously your go-to-market, your massive company, and great distribution channel. But at the same time, you want more than that. You know you want to have a closer, IBM always focuses on partnerships that have intrinsic value. So you talked about offerings, you talked about quantum, blockchain, off-camera talking about cloud containers. >> Joel: Yeah. >> I'd say cloud and containers may be a little closer than those others, but those others are going to take a lot of market development. So what are the offerings that you guys are bringing? How do they get into the hands of your partners? >> I mean, the commonality with all of these, all the emerging offerings, if you ask me, is the distributed nature of the offering. So if you look at blockchain, it's a distributed ledger. It's a distributed transaction chain that's secure. If you look at data, really and we can hark back to say, Hadoop, right before object storage, it's distributed storage, so it's not just storing on your hard drive locally, it's storing on a distributed network of servers that are all over the world and data centers. If you look at cloud, and containers, what you're really doing is not running your application on an individual server that can go down. You're using containers because you want to distribute that application over a large network of servers, so that if one server goes down, you're not going to be hosed. And so I think the fundamental shift that you're seeing is this distributed nature, which in essence is cloud. So I think cloud is just kind of a synonym, in my opinion, for distributed nature of our business. >> That's interesting and that brings up, you're right, cloud and Big Data/Hadoop, we don't talk about Hadoop much anymore, but it kind of got it all started, with that notion of leave the data where it is. And it's the same thing with cloud. You can't just stuff your business into the public cloud. You got to bring the cloud to your data. >> Joel: That's right. >> But that brings up a whole new set of challenges, which obviously, you're in a position just to help solve. Performance, latency, physics come into play. >> Physics is a rough one. It's kind of hard to avoid that one. >> I hear your best people are working on it though. Some other partnerships that you want to sort of, elucidate. >> Yeah, no, I mean we have some really great, so I think the key kind of partnership, I would say area, that I would allude to is, one of the things, and you kind of referenced this, is a lot of our partners, big or small, want to work with our top clients. So they want to work with our top banking clients. They want, 'cause these are, if you look at for example, MaRisk and what we're doing with them around blockchain, and frankly, talk about innovation, they're innovating containers for real, not virtual containers- >> And that's a joint venture right? >> Yeah, it is, and so it's exciting because, what we're bringing to market is, I also lead our startup programs, called the Global Entrepreneurship Program, and so what I'm focused on doing, and you'll probably see more to come this quarter, is how do we actually bridge that end-to-end? How do you, if you're startup or a small business, ultimately reach that kind of global business partner level? And so kind of bridging that, that end-to-end. So we're starting to bring out a number of different incentives for partners, like co-marketing, so I'll help startups when they're early, figure out product-market fit. We'll give you free credits to use our innovative technology, and we'll also bring you into a number of clients, to basically help you not burn all of your cash on creating your own marketing channel. God knows I did that when I was at a start-up. So I think we're doing a lot to kind of bridge that end-to-end, and help any partner kind of come in, and then grow with IBM. I think that's where we're headed. >> I think that's a critical part of your job. Because I mean, obviously IBM is known for its Global 2000, big enterprise presence, but startups, again, fuel that innovation fire. So being able to attract them, which you're proving you can, providing whatever it is, access, early access to cloud services, or like you say, these other offerings that you're producing, in addition to that go-to-market, 'cause it's funny, we always talk about how efficient, capital efficient, software is, but then you have these companies raising hundreds of millions of dollars, why? Because they got to do promotion, marketing, sales, you know, go-to-market. >> Yeah, it's really expensive. I mean, you look at most startups, like their biggest ticket item is usually marketing and sales. And building channels, and so yeah, if you're, you know we're talking to a number of partners who want to work with us because of the fact that, it's not just like, the direct kind of channel, it's also, as you kind of mentioned, there's other challenges that you have to overcome when you're working with a larger company. for example, security is a big one, GDPR compliance now, is a big one, and just making sure that things don't fall over, is a big one. And so a lot of partners work with us because ultimately, a number of the decision makers in these larger enterprises are going, well, I trust IBM, and if IBM says you're good, then I believe you. And so that's where we're kind of starting to pull partners in, and pull an ecosystem towards us. Because of the fact that we can take them through that level of certification. So we have a number of free online courses. So if you go to partners, excuse me, ibm.com/partners/learn there's a number of blockchain courses that you can learn today, and will actually give you a digital certificate, that's actually certified on our own blockchain, which we're actually a first of a kind to do that, which I think is pretty slick, and it's accredited at some of the universities. So I think that's where people are looking to IBM, and other leaders in this industry, is to help them become experts in their, in this technology, and especially in this emerging technology. >> I love that blockchain actually, because it's such a growing, and interesting, and innovative field. But it needs players like IBM, that can bring credibility, enterprise-grade, whether it's security, or just, as I say, credibility. 'Cause you know, this is, so much of negative connotations associated with blockchain and crypto, but companies like IBM coming to the table, enterprise companies, and building that ecosystem out is in my view, crucial. >> Yeah, no, it takes a village. I mean, there's a lot of folks, I mean that's a big reason why I came to IBM, three, four years ago, was because when I was in start-up land, I used to work for H20, I worked for Alpine Data Labs, Datameer, back in the Hadoop days, and what I realized was that, it's an opportunity cost. So you can't really drive true global innovation, transformation, in some of these bigger companies because there's only so much that you can really kind of bite off. And so you know at IBM it's been a really rewarding experience because we have done things like for example, we partnered with Girls Who Code, Treehouse, Udacity. So there's a number of early educators that we've partnered with, to bring code to, to bring technology to, that frankly, would never have access to some of this stuff. Some of this technology, if we didn't form these alliances, and if we didn't join these partnerships. So I'm very excited about the future of IBM, and I'm very excited about the future of what our partners are doing with IBM, because, geez, you know the cloud, and everything that we're doing to make this accessible, is bar none, I mean, it's great. >> I can tell you're excited. You know, spring in your step. Always a lot of energy Joel, really appreciate you coming onto theCUBE. >> Joel: My pleasure. >> Great to see you again. >> Yeah, thanks Dave. >> You're welcome. Alright keep it right there, everybody. We'll be back. We're at the IBM CDO Strategy Summit in San Francisco. You're watching theCUBE. (techno music) (touch-tone phone beeps)
SUMMARY :
Brought to you by IBM. Good to see you again Joel. that you can attract partnerships, To really help drive that innovation, and how you get that technology Yeah, and that's critical, I mean you're right, Yeah, so when I was here last, to operationalizing, you know, machine learning. that we have there, but we're not trying that you're trying to build. to really innovate, you have to find a way in a single platform, what do you call it? So for example, we also did a partnership with Unity, to basically allow a gamer to use voice commands I like the term digital business, to look at how we actually test different I know this is going to look not great for IBM, but also to the other ecosystems, But at the same time, you want more than that. So what are the offerings that you guys are bringing? So if you look at blockchain, it's a distributed ledger. You got to bring the cloud to your data. But that brings up a whole new set of challenges, It's kind of hard to avoid that one. Some other partnerships that you want to sort of, elucidate. and you kind of referenced this, to basically help you not burn all of your cash early access to cloud services, or like you say, that you can learn today, but companies like IBM coming to the table, that you can really kind of bite off. really appreciate you coming onto theCUBE. We're at the IBM CDO Strategy Summit in San Francisco.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Joel | PERSON | 0.99+ |
Joel Horwitz | PERSON | 0.99+ |
Europe | LOCATION | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
Kevin Costner | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Dinesh Nirmal | PERSON | 0.99+ |
Alpine Data Labs | ORGANIZATION | 0.99+ |
Lightbend | ORGANIZATION | 0.99+ |
Verizon | ORGANIZATION | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Hidden Figures | TITLE | 0.99+ |
Bob Lord | PERSON | 0.99+ |
Both | QUANTITY | 0.99+ |
MaRisk | ORGANIZATION | 0.99+ |
both | QUANTITY | 0.99+ |
iPhone 10 | COMMERCIAL_ITEM | 0.99+ |
2015 | DATE | 0.99+ |
Datameer | ORGANIZATION | 0.99+ |
both sides | QUANTITY | 0.99+ |
one story | QUANTITY | 0.99+ |
Think | ORGANIZATION | 0.99+ |
five | DATE | 0.99+ |
hundreds | QUANTITY | 0.99+ |
Treehouse | ORGANIZATION | 0.99+ |
three years ago | DATE | 0.99+ |
developer.ibm.com/code | OTHER | 0.99+ |
Unity | ORGANIZATION | 0.98+ |
two worlds | QUANTITY | 0.98+ |
Reactive | ORGANIZATION | 0.98+ |
GDPR | TITLE | 0.98+ |
one side | QUANTITY | 0.98+ |
Digital Business Group | ORGANIZATION | 0.98+ |
today | DATE | 0.98+ |
Udacity | ORGANIZATION | 0.98+ |
ibm.com/partners/learn | OTHER | 0.98+ |
last month | DATE | 0.98+ |
Watson Studio | ORGANIZATION | 0.98+ |
each year | QUANTITY | 0.97+ |
three | DATE | 0.97+ |
single platform | QUANTITY | 0.97+ |
Girls Who Code | ORGANIZATION | 0.97+ |
Parc 55 | LOCATION | 0.97+ |
one thing | QUANTITY | 0.97+ |
four themes | QUANTITY | 0.97+ |
Spark Technology Center | ORGANIZATION | 0.97+ |
six years ago | DATE | 0.97+ |
H20 | ORGANIZATION | 0.97+ |
four years ago | DATE | 0.97+ |
martech | ORGANIZATION | 0.97+ |
Unity | TITLE | 0.96+ |
hundreds of millions of dollars | QUANTITY | 0.94+ |
Watson Studio | TITLE | 0.94+ |
Dinesh | PERSON | 0.93+ |
one server | QUANTITY | 0.93+ |
Day One Kickoff | BigData NYC 2017
(busy music) >> Announcer: Live from Midtown Manhattan, it's the Cube, covering Big Data New York City 2017, brought to you by SiliconANGLE Media and its ecosystem sponsors. >> Hello, and welcome to the special Cube presentation here in New York City for Big Data NYC, in conjunction with all the activity going on with Strata, Hadoop, Strata Data Conference right around the corner. This is the Cube's special annual event in New York City where we highlight all the trends, technology experts, thought leaders, entrepreneurs here inside the Cube. We have our three days of wall to wall coverage, evening event on Wednesday. I'm John Furrier, the co-host of the Cube, with Jim Kobielus, and Peter Burris will be here all week as well. Kicking off day one, Jim, the monster week of Big Data NYC, which now has turned into, essentially, the big data industry is a huge industry. But now, subsumed within a larger industry of AI, IoT, security. A lot of things have just sucked up the big data world that used to be the Hadoop world, and it just kept on disrupting, and creative disruption of the old guard data warehouse market, which now, looks pale in comparison to the disruption going on right now. >> The data warehouse market is very much vibrant and alive, as is the big data market continuing to innovate. But the innovations, John, have moved up the stack to artificial intelligence and deep learning, as you've indicated, driving more of the Edge applications in the new generation of mobile and smart appliances and things that are coming along like smart, self-driving vehicles and so forth. What we see is data professionals and developers are moving towards new frameworks, like TensorFlow and so forth, for development of the truly disruptive applications. But big data is the foundation. >> I mean, the developers are the key, obviously, open source is growing at an enormous rate. We just had the Linux Foundation, we now have the Open Source Summit, they have kind of rebranded that. They're going to see explosion from code from 64 million lines of code to billions of lines of code, exponential growth. But the bigger picture is that it's not just developers, it's the enterprises now who want hybrid cloud, they want cloud technology. I want to get your reaction to a couple of different threads. One is the notion of community based software, which is open source, extending into the enterprise. We're seeing things like blockchain is hot right now, security, two emerging areas that are overlapping in with big data. You obviously have classic data market, and then you've got AI. All these things kind of come in together, kind of just really putting at the center of all that, this core industry around community and software AI, particular. It's not just about machine learning anymore and data, it's a bigger picture. >> Yeah, in terms of a community, development with open source, much of what we see in the AI arena, for example, with the up and coming, they're all open source tools. There's TensorFlow, there's Cafe, there's Theano and so forth. What we're seeing is not just the frameworks for developing AI that are important, but the entire ecosystem of community based development of capabilities to automate the acquisition of training data, which is so critically important for tuning AI, for its designated purpose, be it doing predictions and abstractions. DevOps, what are coming into being are DevOps frameworks to span the entire life cycle of the creation and the training and deployment and iteration of AI. What we're going to see is, like at the last Spark Summit, there was a very interesting discussion from a Stanford researcher, new open source tools that they're developing out in, actually, in Berkeley, I understand, for, related to development of training data in a more automated fashion for these new challenges. The communities are evolving up the stack to address these requirements with fairly bleeding edge capabilities that will come in the next few years into the mainstream. >> I had a chat with a big time CTO last night, he worked at some of the big web scale company, I won't say the name, give it away. But basically, he asked me a question about IoT, how real is it, and obviously, it's hyped up big time, though. But the issue in all this new markets like IoT and AI is the role of security, because a lot of enterprises are looking at the IoT, certainly in the industrial side has the most relevant low hanging fruit, but at the end of the day, the data modeling, as you're pointing out, becomes a critical thing. Connecting IoT devices to, say, an IP network sounds trivial in concept, but at the end of the day, the surface area for security is oak expose, that's causing people to stop what they're doing, not deploying it as fast. You're seeing kind of like people retrenching and replatforming at the core data centers, and then leveraging a lot of cloud, which is why Azure is hot, Microsoft Ignite Event is pretty hot this week. Role of cloud, role of data in IoT. Is IoT kind of stalled in your mind? Or is it bloating? >> I wouldn't say it's stalled or that it's bloating, but IoT is definitely coming along as the new development focus. For the more disruptive applications that can derive more intelligence directly to the end points that can take varying degrees of automated action to achieve results, but also to very much drive decision support in real time to people on their mobiles or in whatever. What I'm getting at is that IoT is definitely a reality in the real world in terms of our lives. It's definitely a reality in terms of the index generation of data applications. But there's a lot of the back end in terms of readying algorithms and in training data for deployment of really high quality IoT applications, Edge applications, that hasn't come together yet in any coherent practice. >> It's emerging, it's emerging. >> It's emerging. >> It's a lot more work to do. OK, we're going to kick off day one, we've got some great guests, we see Rob Bearden in the house, Rob Thomas from IBM. >> Rob Bearden from Hortonworks. >> Rob Bearden from Hortonworks, and Rob Thomas from IBM. I want to bring up, Rob wrote a book just recently. He wrote Big Data Revolution, but he also wrote a new book called, Every Company is a Tech Company. But he mentions, he kind of teases out this concept of a renaissance, so I want to get your thoughts on this. If you look at Strata, Hadoop, Strata Data, the O'Reilly Conference, which has turned into like a marketing machine, right. A lot of hype there. But as the community model grows up, you're starting to see a renaissance of real creative developers, you're starting to see, not just open source, pure, full stack developers doing all the heavy lifting, but real creative competition, in a renaissance, that's really the key. You're seeing a lot more developer action, tons outside of the, what was classically called the data space. The role of data and how it relates to the developer phenomenon that's going on right now. >> Yeah, it's the maker culture. Rob, in fact, about a year or more ago, IBM, at one of their events, they held a very maker oriented event, I think they called it Datapalooza at one point. What it's looking at, what's going on is it's more than just classic software developers are coming to the fore. When you're looking at IoT or Edge applications, it's hardware developers, it's UX developers, it's developers and designers who are trying to change and drive data driven applications into changing the very fabric of how things are done in the real world. What Peter Burris, we had a wiki about him called Programming in the Real World. What that all involves is there's a new set of skill sets that are coming together to develop these applications. It's well beyond just simply software development, it's well beyond simply data scientists. Maker culture. >> Programming in the real world is a great concept, because you need real time, which comes back down to this. I'm looking for this week from the guests we talked to, what their view is of the data market right now. Because if you want to get real time, you've got to move from that batch world to the real time world. I'm not saying batch is over, you've still got to store data, and that's growing at an exponential rate as well. But real time data, how do you use data in real time, how do the modelings work, how do you scale that. How do you take a DevOps culture to the data world is what I'm looking for. What are you looking for this week? >> What I'm looking for this week, I'm looking for DevOps solutions or platforms or environments for teams of data scientists who are building and training and deploying and evaluating, iterating deep learning and machine learning and natural language processing applications in a continuous release pipeline, and productionizing them. At Wikibon, we are going deeper in that whole notion of DevOps for data science. I mean, IBM's called it inside ops, others call it data ops. What we're seeing across the board is that more and more of our customers are focusing on how do we bring it all together, so the maker culture. >> Operationalizing it. >> Operationalizing it, so that the maker cultures that they have inside their value chain can come together and there's a standard pattern workflow of putting this stuff out and productionizing it, AI productionized in the real world. >> Moving in from the proof of concept notion to actually just getting things done, putting it out in the network, and then bringing it to the masses with operational support. >> Right, like the good folks at IBM with Watson data platform, on some levels, is a DevOPs for data science platform, but it's a collaborative environment. That's what I'm looking to see, and there's a lot of other solution providers who are going down that road. >> I mean, to me, if people have the community traction, that is the new benchmark, in my opinion. You heard it here on the Cube. Community continues to scale, you can start seeing it moving out of open source, you're seeing things like blockchain, you're seeing a decentralized Internet now happening everywhere, not just distributed but decentralized. When you have decentralization, community and software really shine. It's the Cube here in New York City all week. Stay with us for wall to wall coverage through Thursday here in New York City for Big Data NYC, in conjunction with Strata Data, this is the Cube, we'll be back with more coverage after this short break. (busy music) (serious electronic music) (peaceful music) >> Hi, I'm John Furrier, the Co-founder of SiliconANGLE Media, and Co-host of the Cube. I've been in the tech business since I was 19, first programming on mini computers in a large enterprise, and then worked at IBM and Hewlett Packard, a total of nine years in the enterprise, various jobs from programming, training, consulting, and ultimately, as an executive sales person, and then started my first company in 1997, and moved to Silicon Valley in 1999. I've been here ever since. I've always loved technology, and I love covering emerging technology. I was trained as a software developer and love business. I love the impact of software and technology to business. To me, creating technology that starts a company and creates value and jobs is probably one of the most rewarding things I've ever been involved in. I bring that energy to the Cube, because the Cube is where all the ideas are, and where the experts are, where the people are. I think what's most exciting about the Cube is that we get to talk to people who are making things happen, entrepreneurs, CEO of companies, venture capitalists, people who are really, on a day in and day out basis, building great companies. In the technology business, there's just not a lot real time live TV coverage, and the Cube is a non-linear TV operation. We do everything that the TV guys on cable don't do. We do longer interviews, we ask tougher questions. We ask, sometimes, some light questions. We talk about the person and what they feel about. It's not prompted and scripted, it's a conversation, it's authentic. For shows that have the Cube coverage, it makes the show buzz, it creates excitement. More importantly, it creates great content, great digital assets that can be shared instantaneously to the world. Over 31 million people have viewed the Cube, and that is the result of great content, great conversations. I'm so proud to be part of the Cube with a great team. Hi, I'm John Furrier, thanks for watching the Cube. >> Announcer: Coming up on the Cube, Tekan Sundar, CTO of Wine Disco. Live Cube coverage from Big Data NYC 2017 continues in a moment. >> Announcer: Coming up on the Cube, Donna Prlich, Chief Product Officer at Pentaho. Live Cube coverage from Big Data New York City 2017 continues in a moment. >> Announcer: Coming up on the Cube, Amit Walia, Executive Vice President and Chief Product Officer at Informatica. Live Cube coverage from Big Data New York City continues in a moment. >> Announcer: Coming up on the Cube, Prakash Nodili, Co-founder and CEO of Pexif. Live Cube coverage from Big Data New York City continues in a moment. (serious electronic music)
SUMMARY :
it's the Cube, covering Big Data New York City 2017, and creative disruption of the old guard as is the big data market continuing to innovate. kind of just really putting at the center of all that, and the training and deployment and iteration of AI. and replatforming at the core data centers, in the real world in terms of our lives. It's a lot more work to do. in a renaissance, that's really the key. in the real world. Programming in the real world is a great concept, so the maker culture. Operationalizing it, so that the maker cultures Moving in from the proof of concept notion Right, like the good folks at IBM that is the new benchmark, in my opinion. and that is the result of great content, continues in a moment. continues in a moment. continues in a moment. Prakash Nodili, Co-founder and CEO of Pexif.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jim Kobielus | PERSON | 0.99+ |
Donna Prlich | PERSON | 0.99+ |
Rob Bearden | PERSON | 0.99+ |
Amit Walia | PERSON | 0.99+ |
Rob Thomas | PERSON | 0.99+ |
Peter Burris | PERSON | 0.99+ |
Prakash Nodili | PERSON | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
John Furrier | PERSON | 0.99+ |
Jim | PERSON | 0.99+ |
1997 | DATE | 0.99+ |
Berkeley | LOCATION | 0.99+ |
Silicon Valley | LOCATION | 0.99+ |
1999 | DATE | 0.99+ |
Hewlett Packard | ORGANIZATION | 0.99+ |
SiliconANGLE Media | ORGANIZATION | 0.99+ |
Thursday | DATE | 0.99+ |
New York City | LOCATION | 0.99+ |
John | PERSON | 0.99+ |
nine years | QUANTITY | 0.99+ |
Hortonworks | ORGANIZATION | 0.99+ |
Wednesday | DATE | 0.99+ |
Rob | PERSON | 0.99+ |
Pexif | ORGANIZATION | 0.99+ |
Tekan Sundar | PERSON | 0.99+ |
Linux Foundation | ORGANIZATION | 0.99+ |
first company | QUANTITY | 0.99+ |
first | QUANTITY | 0.99+ |
three days | QUANTITY | 0.99+ |
Wikibon | ORGANIZATION | 0.99+ |
Datapalooza | EVENT | 0.99+ |
64 million lines | QUANTITY | 0.98+ |
NYC | LOCATION | 0.98+ |
Midtown Manhattan | LOCATION | 0.98+ |
Big Data | ORGANIZATION | 0.98+ |
19 | QUANTITY | 0.98+ |
this week | DATE | 0.97+ |
Over 31 million people | QUANTITY | 0.97+ |
Spark Summit | EVENT | 0.97+ |
last night | DATE | 0.97+ |
Open Source Summit | EVENT | 0.97+ |
Strata | EVENT | 0.96+ |
One | QUANTITY | 0.96+ |
Programming in the Real World | TITLE | 0.96+ |
Big Data | EVENT | 0.96+ |
Informatica | ORGANIZATION | 0.96+ |
day one | QUANTITY | 0.96+ |
Strata Data | ORGANIZATION | 0.95+ |
two emerging areas | QUANTITY | 0.95+ |
billions of lines | QUANTITY | 0.93+ |
Microsoft | ORGANIZATION | 0.93+ |
TensorFlow | TITLE | 0.92+ |
Strata Data Conference | EVENT | 0.92+ |
Day One | QUANTITY | 0.92+ |
Live Cube | COMMERCIAL_ITEM | 0.92+ |
Cube | ORGANIZATION | 0.91+ |
Every Company is a Tech Company | TITLE | 0.9+ |
Azure | TITLE | 0.9+ |
about a year or more ago | DATE | 0.9+ |
Cube | COMMERCIAL_ITEM | 0.9+ |
2017 | EVENT | 0.89+ |
Wine Disco | ORGANIZATION | 0.89+ |
Big Data Revolution | TITLE | 0.88+ |
Strata | ORGANIZATION | 0.88+ |
Theano | TITLE | 0.88+ |
Watson | ORGANIZATION | 0.85+ |
DevOps | TITLE | 0.84+ |
Ignite Event | EVENT | 0.84+ |
Day One Wrap | BigData NYC 2017
>> Announcer: Live from midtown Manhattan, it's theCUBE covering BigData New York City 2017. Brought to you by SiliconANGLE Media, and its ecosystem sponsors. >> Hello everyone, welcome back to our day one, at Big Data NYC, of three days of wall to wall coverage. This is theCUBE. I'm John Furrier, with my co-hosts Jim Kobielus and Peter Burris. We do this event every year, this is theCUBE's BigData NYC. It's our event that we run in New York City. We have a lot of great content, we have theCUBE going live, we don't go to Strata anymore. We do our own event in conjunction, they have their own event. You can go pay over there and get the booth space, but we do our media event and attract all the influencers, the VIPs, the executives, the entrepreneurs, we've been doing it for five years, we're super excited, and thank our sponsors for allowing us to get here and really appreciate the community for continuing to support theCUBE. We're here to wrap up day one what's going on in New York, certainly we've had a chance to check out the Strata situations, Strata Data, which is Cloudera, and O'Reilly, mainly O'Reilly media, they run that, kind of old school event, guys. Let's kind of discuss the impact of the event in context to the massive growth that's going outside of their event. And their event is a walled garden, you got to pay to get in, they're very strict. They don't really let a lot of people in, but, okay. Outside of that the event it going global, the activity around big data is going global. It's more than Hadoop, we certainly thought about that's old news, but what's the big trend this year? As the horizontally scalable cloud enters the equation. >> I think the big trend, John, is the, and we've talked about in our research, is that we have finally moved away from big data, being associated with a new type of infrastructure. The emergence of AI, deep learning, machine learning, cognitive, all these different names for relatively common things, are an indications that we're starting to move up into people thinking about applications, people thinking about services they can use to get access, or they can get access to build their applications. There's not enough skills. So I think that's probably the biggest thing is that the days of failure being measured by whether or not you can scale your cluster up, are finally behind us. We're using the cloud, other resources, we have enough expertise, the technologies are becoming simpler and more straightforward to do that. And now we're thinking about how we're going to create value out of all of this, which is how we're going to use the data to learn something new about what we're doing in the organization, combine it with advanced software technologies that actually dramatically reduce the amount of work that's necessary to make a decision. >> And the other trend I would say, on top of that, just to kind of put a little cherry on top of that, kind of the business focus which is again, not the speeds and feeds, although under the hood, lot of great innovation going on from deep learning, and there's a ton of stuff. However, the conversation is the business value, how it's transforming work and, but the one thing that nobody's talking about is, this is why I'm not bullish on these one shows, one show meets all kind of thing like O'Reilly Media does, because there's multiple personas in a company now in the ecosystem. There are now a variety of buyers of some products. At least in the old days, you'd go talk to the IT CIO and you're in. Not anymore. You have an analytics person, a Chief Data Officer, you might have an IT person, you might have a cloud person. So you're seeing a completely broader set of potential buyers that are driving the change. We heard Paxata talk about that. And this is a dynamic. >> Yeah, definitely. We see a fair amount of, what I'm sensing about Strata, how it's evolving these big top shows around data, it's evolving around addressing a broader, what we call maker culture. It's more than software developers. It's business analysts, it's the people who build the hardware for the internet of things into which AI and machine learning models are being containerized and embedded. I've, you know, one of the takeaways from today so far, and the keynotes are tomorrow at Strata, but I've been walking the atrium at the Javits Center having some interesting conversations, in addition, of course, to the ones we've been having here at theCUBE. And what I'm notic-- >> John: What are those hallway conversations that you're having? >> Yeah. >> What's going on over there? >> Yeah, what I've, the conversations I've had today have been focused on, the chief trend that I'm starting to sense here is that the productionization of the machine learning development process or pipeline, is super hot. It spans multiple data platforms, of course. You've got a bit of Hadoop in the refinery layer, you've got a bit of in-memory columnar databases, like the Act In discussed at their own, but the more important, not more important, but just as important is that what users are looking at is how can we build these DevOps pipelines for continuous management of releases of machine learning models for productionization, but also for ongoing evaluation and scoring and iteration and redeployment into business applications. You know there's, I had conversations with Mapbar, I had conversations with IBM, I mean, these were atrium conversations about things that they are doing. IBM had an announcement today on the wires and so forth with some relevance to that. And so I'm seeing a fair, I'm hearing, I'm sensing a fair amount of It's The Apps, it's more than just Hadoop. But it's very much the flow of these, these are the core pieces, like AI, core pieces of intellectual property in the most disruptive applications that are being developed these days in all manner, in business and industry in the consumer space. >> So I did not go over to the show floor yet, I've not been over to the Atrium. But, I'll bet you dollars to donuts this is indicative of something that always happens in a complex technology environment. And again, this is something we've thought about particularly talked about here on theCUBE, in fact we talked to Paxata about it a little bit as well. And that is, as an organization gains experience, it starts to specialize. But there's always moments, there' always inflection points in the process of gaining that experience. And by that, or one of the indications of that is that you end up with some people starting to specialize, but not quite sure what they're specializing in yet. And I think that's one of the things that's happening right now is that the skills gap is significant. At the same time that the skills gap is being significant, we're seeing people start to declare their specializations that they don't have skills, necessarily, to perform yet. And the tools aren't catching up. So there's still this tension model, open source, not necessarily focusing on the core problem. Skills looking for tools, and explosion in the number of tools out there, not focused on how you simplify, streamline, and put into operation. How all these things work together. It's going to be an interesting couple of years, but the good news, ultimately, is that we are starting to see for the first time, even on theCUBE interviews today, the emergence of a common language about how we think about the characteristics of the problem. And I think that that heralds a new round of experience and a new round of thinking about what is all the business analysts, the data scientists, the developer, the infrastructure person, business person. >> You know, you bring up that comment, those comments, about the specialists and the skills. We talked, Jim and I talked on the segment this morning about tool shed. We're talking about there are so many tools out there, and everyone loves a good tool, a hammer. But the old expression is if you're a hammer, everything looks like a nail, that's cliche. But what's happened is there are a plethora of tools, right, and tools are good. Platforms are better. As people start to replatformize everything they could have too many tools. So we asked the C Chief Data Officer, he goes yeah, I try to manage the tool tsunami, but his biggest issue was he buys a hammer, and it turns into a lawnmower. That's a vendor mentality of-- >> What a truck. Well, but that's a classic example of what I'm talking about. >> Or someone's trying to use a hammer to mow the lawn right? Again, so this is what you're getting at. >> Yeah! >> The companies out there are groping for relevance, and that's how you can see the pretenders from the winners. >> Well, a tool, fundamentally, is pedagogical. A tool describes the way work is going to be performed, and that's been a lot of what's been happening over the course of the past few years. Now, businesses that get more experience, they're describing their own way of thinking throughout a problem. And they're still not clear on how to bring the tools together because the tools are being generated, put into the marketplace by an expanding array of folks and companies, and they're now starting to shuffle for position. But I think ultimately, what we're going to see happen over the next year and I think this is an inflection point, going back to this big tent notion, is the idea that ultimately we are going to see greater specialization over the next few years. My guess is that this year will probably, should get better, or should get bigger, I'm not certain it will because it's focused on the problems that we already solved and not moving into the problems that we need to focus on. >> Yeah, I mean, a lot of the problems I have with the O'Reilly show is that they try to throw default leadership out there, and there's some smart people that go to that, but the problem is is that it's too monetization, they try to make too much money from the event when this action's happening. And this is where the tool becomes, the hammer becomes a lawnmower, because what's happening is that the vendor's trying to stay alive. And you mentioned this earlier, to your point, the customers that are buyers of the technology don't want to have something that's not going to be a fit, that's going to be agile from us. They don't want the hammer that they bought to turn into something that they didn't buy it for. And sometimes, teams can't make that leap, skillset-wise, to literally pivot overnight. Especially as a startup. So this is where the selection of the companies makes a big difference. And a lot of the clients, a lot of customers that we're serving on the end user side are reaching the conclusion that the tools themselves, while important, are clearly not where the value is. The value is in how they put them together for their business. And that's something that's going to have to, again, that's a maturation process, roles, responsibilities, the chief data officer, they're going to have a role in that or not, but ultimately, they're going to have to start finding their pipelines, their process for ingestion out to analysis. >> Let me get your reaction, you guys, your reactions to this tape. Because one of the things that I heard today, and I think this validates a bigger trend as we talk about the landscape of the markup from the event to how people are behaving and promoting and building products and companies. The pattern that I'm hearing, we said it multiple times on theCUBE today and one from the guy who's basically reading the script, is, in his interview, explaining 'cause it's so factual, I asked him the straight-up question, how do you deal with suppliers? What's happening is the trend is don't show me sizzle. I want to see the steak. Don't sell me hype, I got too many business things to work on right now, I need to nail down some core things. I got application development, I got security to build out big time, and then I got all those data channels that I need, I don't have time for you to sell me a hammer that might not be a hammer in the future! So I need real results, I need real performance that's going to have a business impact. That is the theme, and that trumps the hype. I see that becoming a huge thing right now. Your thoughts, reactions, guys-- >> Well I'll start-- >> What's your reaction then? True or false on the trend? Be-- >> Peter: True! >> Get down to business. >> I'll say that much, true, but go ahead. >> I'll say true as well, but let me just add some context. I think a show like O'Reilly Strata is good up to a point, especially to catalyze an industry, a growing industry like big data's own understanding of it, of the value that all these piece parts, Hadoop and Spark and so forth, can add, can provide when deployed in a unit according to some emerging patterns, whatever. But at a certain point where a space like this becomes well-established, it just becomes a pure marketing event. And customers, at a certain point say, you know, I come here for ideas about things that I can do in my environ, my business, that could actually many ways help me to do new things. You know, you can't get that at a marketing-oriented, you can get that, as a user, more at a research-oriented show. When it's an emerging market, like let's say Spark has been, like the Spark Summit was in the beginning, those are kind of like, when industries go through the phase those are sort of in the beginning, sort of research-focused shows where industry, the people who are doing the development of this new architecture, they talk ideas. Now I think in 2017, where we're at now, is what the idea is everybody's trying to get their heads around, they're all around AI, what the heck that is. For a show like an O'Reilly Ready show to have relevance in a market that's in this much ferment of really innovation around AI and deep learning, there needs to be a core research focus that you don't get at this point in the lifecycle of Strata, for example. So that's my take on what's going on. >> So, my take is this. And first of all, I agree with everything you said, so it's not in opposition to anything. Many years ago I had this thought that I think still is very true. And that is the value of industry, the value of infrastructure is inversely correlated with the degree to which anybody knows anything about it. So if I know a lot about my infrastructure, it's not creating a lot of business value. In fact, more often than not, it's not working, which is why people end up knowing more about it. But the problem is, the way that technology has always been sold is as a differentiated, some sort of value-add thing. So you end up with this tension. And this is an application domain, a very, very complex application domain like big data. The tension is, my tool is so great that, and it's differentiating all those other stuff, yeah but it becomes valuable to me if and only if nobody knows it exists. So I think, and one of the reasons why I bring this up, John, is many of the companies that are in the big data space today that are most successful are companies that are positioning themselves as a service. There's a lot of interesting SaaS applications for big data analysis, pipeline management, all the other things you can talk about, that are actually being rendered as a service, and not as a product. So that all you need to know is what the tool does. You don't need to know the tool. And I don't know that that's necessarily going to last, but I think it's very, very interesting that a lot of the more successful companies that we're talking to are themselves mere infrastructure SaaS companies. >> Because-- >> AtScale is interesting, though. They came in as a service. But their service has an interesting value proposition. They can allow you to essentially virtualize the data to play with it, so people can actually sandbox data. And if it gets traction, they can then double-down on it. So to me that's a freebie. To me, I'm a customer, I got to love that kind of environment because you're essentially giving almost a developer-like environment-- >> Peter: Value without necessarily-- >> Yeah, the cost, and the guy gets the signal from the marketplace, his customer, of what data resolves. To me that's a very cool scene. I don't, you saying that's bad, or? >> No, no, I think it's interesting. I think it's-- >> So you're saying service is-- >> So what I'm saying is, what I'm saying is, that the value of infrastructure is inversely proportional to the degree to which anybody knows anything about it. But you've got a bunch of companies who are selling, effectively, infrastructure software, so it's a value-add thing, and that creates a problem. And a lot of other companies not only have the ability to sell something as a service as opposed to a product, they can put the service froward, and people are using the service and getting what they need out of it without knowing anything about the tool. >> I like that. Let me just maybe possibly restate what you just said. When a market goes toward a SaaS go-to-market delivery model for solutions, the user, the buyer's focus is shifted away from what the solution can do, I mean, how it works under the cover. >> Peter: Quote, value-add-- >> To what it can do potentially for you. >> The business, that's right. >> But you're not going to, don't get distracted by the implementation details. You have then as a user become laser-focused on, wow, there's a bunch of things that this can do for me. I don't care how it works, really. You SaaS provider, you worry about that stuff. I can worry now about somehow extracting the value. I'm not distracted. >> This show, or this domain, is one of the domains where SaaS has moved, just as we're thinking about moving up the stack, the SaaS business model is moving down the stack in the big data world. >> All right, so, in summary, the stack is changing. Predictions for the next few days. What are we going to see come out of Strata Data, and our BigData NYC? 'Cause remember, this show was always a big hit, but it's very clear from the data on our dashboards, we're seeing all the social data. Microsoft Ignite is going on, and Microsoft Azure, just in the past few years, has burst on the scene. Cloud is sucking the oxygen out of the big data event. Or is it? >> I doubt it was sucking it out of the event, but you know, theCUBE is in, theCUBE is not at Ignite. Where's theCUBE right now? >> John: BigData NYC. >> No, it's here, but it's also at the Splunk show. >> John: That's true. >> And isn't it interesting-- >> John: We're sucking the data out of two events. >> Did a lot of people coming in, exactly. A lot of people coming-- >> We're live streaming in a streaming data kind of-- >> John just said we suck, there's that record saying that. >> We're sucking all the data. >> So we are-- >> We're sharing data. These videos are data-driven. >> Yeah, absolutely, but the point is, ultimately, is that, is that Splunk is an example of a company that's putting forward a service about how you do this and not necessarily a product focus. And a lot of the folks that are coming on theCUBE here are also going on to theCUBE down in Washington D.C., which is where the Splunk show's at. And so I think one of the things, one of the predictions I'll make, is that we're going to hear over the next couple of days more companies talk about their SaaS trash. >> Yeah, I mean I just think, I agree with you, but I also agree with the comments about the technology coming together. And here's one thing I want to throw on the table. I've gotten the sense a few times about connecting the dots on it, we'll put it out publicly for comment right now. The role that communities will play outside of developer, is going to be astronomical. I think we're seeing signals, certainly open-source communities have been around for a long time. They continue to grow shoulders of giants before them. Even these events like O'Reilly, which are a small community that they rely on is now not the only game in town. We're seeing the notion of a community strategy in things like Blockchain, you're seeing it in business, you're seeing people rolling out their recruitment to say, data scientists. You're seeing a community model developing in business, yes or no? >> Yes, but I would say, I would put it this way, John. That it's always been there. The difference is that we're now getting enough experience with things that have occurred, for example, collaboration, communal, communal collaboration in open-source software that people are now saying, and they've developed a bunch of social networking techniques where they can actually analyze how those communities work together, but now they're saying, hmm, I've figured out how to do an assessment analysis understanding that community. I'm going to see if I can take that same concept and apply it over here to how sales works, or how B-to-B engagement works, or how marketing gets conducted, or how sales and marketing work together. And they're discovering that the same way of thinking is actually very fruitful over there. So I totally agree, 100%. >> So they don't rely on other people's version of a community, they can essentially construct their own. >> They are, they are-- >> John: Or enabling their own. >> That's right, they are bringing that approach to thinking about a community-driven business and they're applying it to a lot of new ways, and that's very exciting. >> As the world gets connected with mobile and internet of things as we're seeing, it's one big online community. We're seeing things, I'm writing a post right now, what you could, what B-to-B markets should learn from the fake news problem. And that is content and infrastructure are now contextually tied together. >> Peter: Totally. >> And related. The payload of the fake news is also related to the gamification of the network effect, hence the targeting, hence the weaponization. >> Hey, we wrote the three Cs, we wrote a piece on the three Cs of strategy a year and a half ago. Content, community, context. And at the end of the day, the most important thing to what you're saying about, is that there is, you know, right now people talk about social networking. Social media, you think Facebook. Facebook is a community with a single context, stay in touch with your friends. >> Connections. >> Connections. But what you're really saying is that for the first time we're now going to see an enormous amount of technology being applied to the fullness of all the communities. We're going to see a lot more communities being created with the software, each driven by what content does, creates value, against the context of how it works, where the community's defined in terms of what do we do? >> Let me focus on the fact that bringing, using community as a framework for understanding how the software world is evolving. The software world is evolving towards, I've said this many times in my work about a resurge, the data scientists or data people, data science skills are the core developers in this new era. Now, what is data science all about at its heart? Machine learning, building, and training machine learning models. And so training machine learning models is everything towards making sure that they are fit for their predicted purpose of classification. Training data, where you get all the training data from to feed all, to train all these models? Where do you get all the human resources to label, to do the labeling of the data sets, and so forth, that you need communities, crowdsourcing and whatnot, and you need sustainable communities that can supply the data and the labeling services, and so forth, to be able to sustain the AI and machine learning revolution. So content, creating data and so forth, really rules in this new era, like-- >> The interest in machine learning is at an all-time high, I guess. >> Jim: Yeah, oh yeah, very much so. >> Got it, I agree. I think the social grab, interest grab, value grab is emerging. I think communities, content, context, communities are relevant. I think a lot of things are going to change, and that the scuttlebutt that I'm hearing in this area now is it's not about the big event anymore. It's about the digital component. I think you're seeing people recognize that, but they still want to do the face-to-face. >> You know what, that's right. That's right, they still want, let's put it this way. That there are, that the whole point of community is we do things together. And there are some things that are still easier to do together if we get together. >> But B-to-B marketing, you just can't say, we're not going to do events when there's a whole machinery behind events. Legion batch marketing, we call it. There's a lot of stuff that goes on in that funnel. You can't just say hey, we're going to do a blog post. >> People still need to connect. >> So it's good, but there's some online tools that are happening, so of course. You wanted to say something? >> Yeah, I just want to say one thing. Face to face validates the source of expertise. I don't really fully trust an expert, I can't in my heart engage with them, 'til I actually meet them and figure out in person whether they really do have the goods, or whether they're repurposing some thinking that they got from elsewhere and they gussy it up. So face, there's no substitute for face-to-face to validate the expertise. The expertise that you value enough to want to engage in your solution, or whatever it might be. >> Awesome, I agree. Online activities, the content, we're streaming the data, theCUBE, this is our annual event in New York City. We've got three days of coverage, Tuesday, Wednesday, Thursday, here, theCUBE in Manhattan, right around the corner from Strata Hadoop, the Javits Center of influencers. We're here with the VIPs, with the entrepreneurs, with the CEOs and all the top analysts from WikiBon and around the community. Be there tomorrow all day, day one wrap up is done. Thanks for watching, see you tomorrow. (rippling music)
SUMMARY :
Brought to you by SiliconANGLE Media, of the event in context to the massive growth is that the days of failure being measured by of potential buyers that are driving the change. and the keynotes are tomorrow at Strata, is that the productionization of the machine learning is that the skills gap is significant. But the old expression is if you're a hammer, of what I'm talking about. Again, so this is what you're getting at. and that's how you can see the pretenders from the winners. is the idea that ultimately we are going to see And a lot of the clients, a lot of customers from the event to how people are behaving of it, of the value that all these piece parts, And that is the value of industry, So to me that's a freebie. from the marketplace, his customer, of what data resolves. I think it's-- And a lot of other companies not only have the ability for solutions, the user, the buyer's focus To what it can do by the implementation details. is one of the domains where SaaS has moved, Cloud is sucking the oxygen out of the big data event. I doubt it was sucking it out of the event, but you know, Did a lot of people coming in, exactly. We're sharing data. And a lot of the folks that are coming on theCUBE here is now not the only game in town. and apply it over here to how sales works, of a community, they can essentially construct their own. and they're applying it to a lot of new ways, from the fake news problem. hence the targeting, hence the weaponization. And at the end of the day, the most important thing We're going to see a lot more communities being created that can supply the data and the labeling services, is at an all-time high, I guess. and that the scuttlebutt that I'm hearing And there are some things that are still easier to do There's a lot of stuff that goes on in that funnel. that are happening, so of course. The expertise that you value enough to want to engage and around the community.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jim Kobielus | PERSON | 0.99+ |
Peter Burris | PERSON | 0.99+ |
O'Reilly | ORGANIZATION | 0.99+ |
Jim | PERSON | 0.99+ |
John | PERSON | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
O'Reilly Media | ORGANIZATION | 0.99+ |
Manhattan | LOCATION | 0.99+ |
2017 | DATE | 0.99+ |
John Furrier | PERSON | 0.99+ |
New York City | LOCATION | 0.99+ |
Peter | PERSON | 0.99+ |
Washington D.C. | LOCATION | 0.99+ |
New York | LOCATION | 0.99+ |
tomorrow | DATE | 0.99+ |
five years | QUANTITY | 0.99+ |
two events | QUANTITY | 0.99+ |
100% | QUANTITY | 0.99+ |
Cloudera | ORGANIZATION | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
SiliconANGLE Media | ORGANIZATION | 0.99+ |
first time | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
Wednesday | DATE | 0.99+ |
a year and a half ago | DATE | 0.99+ |
Thursday | DATE | 0.99+ |
one | QUANTITY | 0.99+ |
Spark Summit | EVENT | 0.99+ |
three days | QUANTITY | 0.99+ |
Tuesday | DATE | 0.98+ |
Javits Center | LOCATION | 0.98+ |
Splunk | ORGANIZATION | 0.98+ |
Paxata | ORGANIZATION | 0.98+ |
ORGANIZATION | 0.98+ | |
next year | DATE | 0.97+ |
this year | DATE | 0.97+ |
SaaS | TITLE | 0.97+ |
day one | QUANTITY | 0.96+ |
NYC | LOCATION | 0.96+ |
first | QUANTITY | 0.96+ |
one thing | QUANTITY | 0.96+ |
WikiBon | ORGANIZATION | 0.95+ |
one show | QUANTITY | 0.94+ |
one shows | QUANTITY | 0.94+ |
BigData | ORGANIZATION | 0.94+ |
Many years ago | DATE | 0.93+ |
Strata | LOCATION | 0.93+ |
Strata Hadoop | LOCATION | 0.92+ |
each | QUANTITY | 0.91+ |
three Cs | QUANTITY | 0.9+ |
Javits Center | ORGANIZATION | 0.89+ |
midtown Manhattan | LOCATION | 0.88+ |
theCUBE | ORGANIZATION | 0.87+ |
Strata | TITLE | 0.87+ |
past few years | DATE | 0.87+ |
Dave Tang, Western Digital – When IoT Met AI: The Intelligence of Things - #theCUBE
>> Presenter: From the Fairmont Hotel, in the heart of Silicon Valley, it's theCUBE. Covering When IoT Met AI The Intelligence of Things. Brought to you by Western Digital. >> Hey welcome back everybody, Jeff Frick here with theCUBE. We're in downtown San Jose at the Fairmont Hotel, at an event called When IoT Met AI The Intelligence of Things. You've heard about the internet of things, and on the intelligence of things, it's IoT, it's AI, it's AR, all this stuff is really coming to play, it's very interesting space, still a lot of start-up activity, still a lot of big companies making plays in this space. So we're excited to be here, and really joined by our host, big thanks to Western Digital for hosting this event with WDLabs' Dave Tang. Got newly promoted since last we spoke. The SVP of corporate marketing and communications, for Western Digital, Dave great to see you as usual. >> Well, great to be here, thanks. >> So I don't think the need for more storage is going down anytime soon, that's kind of my takeaway. >> No, no, yeah. If this wall of data just keeps growing. >> Yeah, I think the term we had yesterday at the Ag event that we were at, also sponsored by you, is really the flood of data using an agricultural term. But it's pretty fascinating, as more, and more, and more data is not only coming off the sensors, but coming off the people, and used in so many more ways. >> That's right, yeah we see it as a virtual cycle, you create more data, you find more uses for that data to harness the power and unleash the promise of that data, and then you create even more data. So, when that virtual cycle of creating more, and finding more uses of it, and yeah one of the things that we find interesting, that's related to this event with IoT and AI, is this notion that data is falling into two general categories. There's big data, and there's fast data. So, big data I think everyone is quite familiar with by this time, these large aggregated likes of data that you can extract information out of. Look for insights and connections between data, predict the future, and create more prescriptive recommendations, right? >> Right. >> And through all of that you can gain algorithms that help to make predictions, or can help machines run based on that data. So we've gone through this phase where we focused a lot on how we harness big data, but now we're taking these algorithms that we've gleaned from that, and we're able to put them in real time applications, and that's sort of been the birth of fast data, it's been really-- >> Right, the streaming data. We cover Spark Summit, we cover Flink, and New, a new kind of open source project that came out of Berlin. That some people would say the next generation of Spark, and the other thing, you know, good for you guys, is that it used to be, not only was it old data, but it was a sampling of old data. Now on this new data, and the data stream that's all of the data. And I would actually challenge, I wonder if that separation as you describe, will stay, because I got to tell you, the last little drive I bought, just last week, was an SSD drive, you know, one terabyte. I needed some storage, and I had a choice between spinning disc and not, and I went with the flat. I mean, 'cause what's fascinating to me, is the second order benefits that we keep hearing time, and time, and time again, once people become a data-driven enterprise, are way more than just that kind of top-level thing that they thought. >> Exactly, and that's sort of that virtual cycle, you got to taste, and you learn how to use it, and then you want more. >> Jeff: Right, right. >> And that's the great thing about the breadth of technologies and products that Western Digital has, is from the solid state products, the higher performance flash products that we have, to the higher capacity helium-filled drive technologies, as well as devices going on up into systems, we cover this whole spectrum of fast data and big data. >> Right, right. >> I'll give an example. So credit card fraud detection is an interesting area. Billions of dollars potentially being lost there. Well to learn how to predict when transactions are fraudulent, you have to study massive amounts of data. Billions of transactions, so that's the big data side of it, and then as soon as you do that, you can take those algorithms and run them in real time. So as transactions come in for authorization, those algorithms can determine, before they're approved, that one's fraudulent, and that one's not. Save a lot of time and processing for fraud claims. So that's a great example of once you learn something from big data, you apply it to the real-time realm, and it's quite dire right? And then that spawned you to collect even more data, because you want to find new applications and new uses. >> Right, and too kind of this wave of computing back and forth from the shared services computer, then the desktop computer, now it's back to the cloud, and then now it's-- >> Dave: Out with the edge. >> IoT, it's all about the edge. >> Yeah, right. >> And at the end of the day, it's going to be application-specific. What needs to be processed locally, what needs to be processed back at the computer, and then all the different platforms. We were again at a navigation for autonomous vehicles show, who knew there was such a thing that small? And even the attributes of the storage required in the ecosystem of a car, right? And the environmental conditions-- >> That's right. >> Is the word I'm looking for. Completely different, new opportunity, kind of new class of hardware required to operate in that environment, and again that still combines cloud and Edge, sensors and maps. So just the, I don't think that the man's going down David. >> Yeah, absolutely >> I think you're in a good spot. (Jeff laughing) >> You're absolutely right, and even though we try to simplify into fast data, and big data, and Core and Edge, what we're finding is that applications are increasingly specialized, and have specialized needs in terms of the type of data. Is it large amounts of data, is it streaming? You know, what are the performance characteristics, and how is it being transformed, what's the compute aspect of it? And what we're finding, is that the days of general-purpose compute and storage, and memory platforms, are fading, and we're getting into environments with increasingly specialized architectures, across all those elements. Compute, memory and storage. So that's what's really exciting to be in our spot in the industry, is that we're looking at creating the future by developing new technologies that continue to fuel that growth even further, and fuel the uses of data even further. >> And fascinating just the ongoing case of Moore's law, which I know is not, you know you're not making microprocessors, but I think it's so powerful. Moore's law really is a philosophy, as opposed to an architectural spec. Just this relentless pace of innovation, and you guys just continue to push the envelope. So what are your kind of priorities? I can't believe we're halfway through 2017 already, but for kind of the balance of the year kind of, what are some of your top-of-mind things? I know it's exciting times, you're going through the merger, you know, the company is in a great space. What are your kind of top priorities for the next several months? >> Well, so, I think as a company that has gone through serial acquisitions and integrations, of course we're continuing to drive the transformation of the overall business. >> But the fun stuff right? It's not to increase your staff (Jeff laughing). >> Right, yeah, that is the hardware. >> Stitching together the European systems. >> But yeah, the fun stuff includes pushing the limits even further with solid state technologies, with our 3D NAND technologies. You know, we're leading the industry in 64 layer 3D NAND, and just yesterday we announced a 96 layer 3D NAND. So pushing those limits even further, so that we can provide higher capacities in smaller footprints, lower power, in mobile devices and out on the Edge, to drive all these exciting opportunities in IoT an AI. >> It's crazy, it's crazy. >> Yeah it is, yeah. >> You know, terabyte SD cards, terabyte Micro SD cards, I mean the amount of power that you guys pack into these smaller and smaller packages, it's magical. I mean it's absolutely magic. >> Yeah, and the same goes on the other end of the spectrum, with high-capacity devices. Our helium-filled drives are getting higher and higher capacity, 10, 12, 14 terabyte high-capacity devices for that big data core, that all the data has to end up with at some point. So we're trying to keep a balance of pushing the limits on both ends. >> Alright, well Dave, thanks for taking a few minutes out of your busy day, and congratulations on all your success. >> Great, good to be here. >> Alright, he's Dave Tang from Western Digital, he's changing your world, my world, and everyone else's. We're here in San Jose, you're watching theCUBE, thanks for watching.
SUMMARY :
in the heart of Silicon Valley, it's theCUBE. and on the intelligence of things, is going down anytime soon, that's kind of my takeaway. If this wall of data just keeps growing. is not only coming off the sensors, and then you create even more data. and that's sort of been the birth of fast data, and the other thing, you know, good for you guys, and then you want more. And that's the great thing about the breadth and then as soon as you do that, And at the end of the day, and again that still combines cloud and Edge, I think you're in a good spot. is that the days of general-purpose compute and storage, but for kind of the balance of the year kind of, of the overall business. But the fun stuff right? in mobile devices and out on the Edge, I mean the amount of power that you guys pack that all the data has to end up with at some point. and congratulations on all your success. and everyone else's.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jeff Frick | PERSON | 0.99+ |
Dave Tang | PERSON | 0.99+ |
Jeff | PERSON | 0.99+ |
San Jose | LOCATION | 0.99+ |
Western Digital | ORGANIZATION | 0.99+ |
Dave | PERSON | 0.99+ |
12 | QUANTITY | 0.99+ |
10 | QUANTITY | 0.99+ |
Berlin | LOCATION | 0.99+ |
Silicon Valley | LOCATION | 0.99+ |
David | PERSON | 0.99+ |
yesterday | DATE | 0.99+ |
last week | DATE | 0.99+ |
2017 | DATE | 0.99+ |
second order | QUANTITY | 0.99+ |
both ends | QUANTITY | 0.98+ |
Billions of dollars | QUANTITY | 0.98+ |
one terabyte | QUANTITY | 0.97+ |
Flink | ORGANIZATION | 0.96+ |
The Intelligence of Things | TITLE | 0.95+ |
14 terabyte | QUANTITY | 0.95+ |
Ag | EVENT | 0.94+ |
one | QUANTITY | 0.94+ |
two general categories | QUANTITY | 0.91+ |
European | OTHER | 0.9+ |
theCUBE | ORGANIZATION | 0.87+ |
#theCUBE | ORGANIZATION | 0.87+ |
Billions of transactions | QUANTITY | 0.87+ |
Fairmont Hotel | LOCATION | 0.87+ |
WDLabs' | ORGANIZATION | 0.81+ |
64 | QUANTITY | 0.81+ |
Spark Summit | EVENT | 0.71+ |
96 layer | QUANTITY | 0.67+ |
Moore | PERSON | 0.66+ |
yone | PERSON | 0.66+ |
next several months | DATE | 0.64+ |
Core | ORGANIZATION | 0.59+ |
Edge | TITLE | 0.58+ |
terabyte | ORGANIZATION | 0.55+ |
layer 3D | OTHER | 0.55+ |
Spark | TITLE | 0.46+ |
theCUBE | TITLE | 0.42+ |
When IoT | TITLE | 0.36+ |
3D | QUANTITY | 0.26+ |
George Chow, Simba Technologies - DataWorks Summit 2017
>> (Announcer) Live from San Jose, in the heart of Silicon Valley, it's theCUBE covering DataWorks Summit 2017, brought to you by Hortonworks. >> Hi everybody, this is George Gilbert, Big Data and Analytics Analyst with Wikibon. We are wrapping up our show on theCUBE today at DataWorks 2017 in San Jose. It has been a very interesting day, and we have a special guest to help us do a survey of the wrap-up, George Chow from Simba. We used to call him Chief Technology Officer, now he's Technology Fellow, but when we was explaining the different in titles to me, I thought he said Technology Felon. (George Chow laughs) But he's since corrected me. >> Yes, very much so >> So George and I have been, we've been looking at both Spark Summit last week and DataWorks this week. What are some of the big advances that really caught your attention? >> What's caught my attention actually is how much manufacturing has really, I think, caught into the streaming data. I think last week was very notable that both Volkswagon and Audi actually had case studies for how they're using streaming data. And I think just before the break now, there was also a similar session from Ford, showcasing what they are doing around streaming data. >> And are they using the streaming analytics capabilities for autonomous driving, or is it other telemetry that they're analyzing? >> The, what is it, I think the Volkswagon study was production, because I still have to review the notes, but the one for Audi was actually quite interesting because it was for managing paint defect. >> (George Gilbert) For paint-- >> Paint defect. >> (George Gilbert) Oh. >> So what they were doing, they were essentially recording the environmental condition that they were painting the cars in, basically the entire pipeline-- >> To predict when there would be imperfections. >> (George Chow) Yes. >> Because paint is an extremely high-value sort of step in the assembly process. >> Yes, what they are trying to do is to essentially make a connection between downstream defect, like future defect, and somewhat trying to pinpoint the causes upstream. So the idea is that if they record all the environmental conditions early on, they could turn around and hopefully figure it out later on. >> Okay, this sounds really, really concrete. So what are some of the surprising environmental variables that they're tracking, and then what's the technology that they're using to build model and then anticipate if there's a problem? >> I think the surprising finding they said were actually, I think it was a humidity or fan speed, if I recall, at the time when the paint was being applied, because essentially, paint has to be... Paint is very sensitive to the condition that is being applied to the body. So my recollection is that one of the finding was that it was a narrow window during which the paint were, like, ideal, in terms of having the least amount of defect. >> So, had they built a digital twin style model, where it's like a digital replica of some aspects of the car, or was it more of a predictive model that had telemetry coming at it, and when it's an outside a certain bounds they know they're going to have defects downstream? >> I think they're still working on the predictive model, or actually the model is still being built, because they are essentially trying to build that model to figure out how they should be tuning the production pipeline. >> Got it, so this is sort of still in the development phase? >> (George Chow) Yeah, yeah >> And can you tell us, did they talk about the technologies that they're using? >> I remember the... It's a little hazy now because after a couple weeks of conference, so I don't remember the specifics because I was counting on the recordings to come out in a couples weeks' time. So I'll definitely share that. It's a case study to keep an eye on. >> So tell us, were there other ones where this use of real-time or near real-time data had some applications that we couldn't do before because we now can do things with very low latency? >> I think that's the one that I was looking forward to with Ford. That was the session just earlier, I think about an hour ago. The session actually consisted of a demo that was being done live, you know. It was being streamed to us where they were showcasing the data that was coming off a car that's been rigged up. >> So what data were they tracking and what were they trying to anticipate here? >> They didn't give enough detail, but it was basically data coming off of the CAN bus of the car, so if anybody is familiar with the-- >> Oh that's right, you're a car guru, and you and I compare, well our latest favorite is the Porche Macan >> Yes, yes. >> SUV, okay. >> But yeah, they were looking at streaming the performance data of the car as well as the location data. >> Okay, and... Oh, this sounds more like a test case, like can we get telemetry data that might be good for insurance or for... >> Well they've built out the system enough using the Lambda Architecture with Kafka, so they were actually consuming the data in real-time, and the demo was actually exactly seeing the data being ingested and being acted on. So in the case they were doing a simplistic visualization of just placing the car on the Google Map so you can basically follow the car around. >> Okay so, what was the technical components in the car, and then, how much data were they sending to some, or where was the data being sent to, or how much of the data? >> The data was actually sent, streamed, all the way into Ford's own data centers. So they were using NiFi with all the right proxy-- >> (George Gilbert) NiFi being from Hortonworks there. >> Yeah, yeah >> The Hortonworks data flow, okay >> Yeah, with all the appropriate proxys and firewall to bring it all the way into a secure environment. >> Wow >> So it was quite impressive from the point of view of, it was life data coming off of the 4G modem, well actually being uploaded through the 4G modem in the car. >> Wow, okay, did they say how much compute and storage they needed in the device, in this case the car? >> I think they were using a very lightweight platform. They were streaming apparently from the Raspberry Pi. >> (George Gilbert) Oh, interesting. >> But they were very guarded about what was inside the data center because, you know, for competitive reasons, they couldn't share much about how big or how large a scale they could operate at. >> Okay, so Simba has been doing ODBC and JDBC drivers to standard APIs, to databases for a long time. That was all about, that was an era where either it was interactive or batch. So, how is streaming, sort of big picture, going to change the way applications are built? >> Well, one way to think about streaming is that if you look at many of these APIs, into these systems, like Spark is a good example, where they're trying to harmonize streaming and batch, or rather, to take away the need to deal with it as a streaming system as opposed to a batch system, because it's obviously much easier to think about and reason about your system when it is traditional, like in the traditional batch model. So, the way that I see it also happening is that streaming systems will, you could say will adapt, will actually become easier to build, and everyone is trying to make it easier to build, so that you don't have to think about and reason about it as a streaming system. >> Okay, so this is really important. But they have to make a trade-off if they do it that way. So there's the desire for leveraging skill sets, which were all batch-oriented, and then, presumably SQL, which is a data manipulation everyone's comfortable with, but then, if you're doing it batch-oriented, you have a portion of time where you're not sure you have the final answer. And I assume if you were in a streaming-first solution, you would explicitly know whether you have all the data or don't, as opposed to late arriving stuff, that might come later. >> Yes, but what I'm referring to is actually the programming model. All I'm saying is that more and more people will want streaming applications, but more and more people need to develop it quickly, without having to build it in a very specialized fashion. So when you look at, let's say the example of Spark, when they focus on structured streaming, the whole idea is to make it possible for you to develop the app without having to write it from scratch. And the comment about SQL is actually exactly on point, because the idea is that you want to work with the data, you can say, not mindful, not with a lot of work to account for the fact that it is actually streaming data that could arrive out of order even, so the whole idea is that if you can build applications in a more consistent way, irrespective whether it's batch or streaming, you're better off. >> So, last week even though we didn't have a major release of Spark, we had like a point release, or a discussion about the 2.2 release, and that's of course very relevant for our big data ecosystem since Spark has become the compute engine for it. Explain the significance where the reaction time, the latency for Spark, went down from several hundred milliseconds to one millisecond or below. What are the implications for the programming model and for the applications you can build with it. >> Actually, hitting that new threshold, the millisecond, is actually a very important milestone because when you look at a typical scenario, let's say with AdTech where you're serving ads, you really only have, maybe, on the order about 100 or maybe 200 millisecond max to actually turn around. >> And that max includes a bunch of things, not just the calculation. >> Yeah, and that, let's say 100 milliseconds, includes transfer time, which means that in your real budget, you only have allowances for maybe, under 10 to 20 milliseconds to compute and do any work. So being able to actually have a system that delivers millisecond-level performance actually gives you ability to use Spark right now in that scenario. >> Okay, so in other words, now they can claim, even if it's not per event processing, they can claim that they can react so fast that it's as good as per event processing, is that fair to say? >> Yes, yes that's very fair. >> Okay, that's significant. So, what type... How would you see applications changing? We've only got another minute or two, but how do you see applications changing now that, Spark has been designed for people that have traditional, batch-oriented skills, but who can now learn how to do streaming, real-time applications without learning anything really new. How will that change what we see next year? >> Well I think we should be careful to not pigeonhole Spark as something built for batch, because I think the idea is that, you could say, the originators, of Spark know that it's all about the ease of development, and it's the ease of reasoning about your system. It's not the fact that the technology is built for batch, so the fact that you could use your knowledge and experience and an API that actually is familiar, should leverage it for something that you can build for streaming. That's the power, you could say. That's the strength of what the Spark project has taken on. >> Okay, we're going to have to end it on that note. There's so much more to go through. George, you will be back as a favorite guest on the show. There will be many more interviews to come. >> Thank you. >> With that, this is George Gilbert. We are DataWorks 2017 in San Jose. We had a great day today. We learned a lot from Rob Bearden and Rob Thomas up front about the IBM deal. We had Scott Gnau, CTO of Hortonworks on several times, and we've come away with an appreciation for a partnership now between IBM and Hortonworks that can take the two of them into a set of use cases that neither one on its own could really handle before. So today was a significant day. Tune in tomorrow, we have another great set of guests. Keynotes start at nine, and our guests will be on starting at 11. So with that, this is George Gilbert, signing out. Have a good night. (energetic, echoing chord and drum beat)
SUMMARY :
in the heart of Silicon Valley, do a survey of the wrap-up, What are some of the big advances caught into the streaming data. but the one for Audi was actually quite interesting in the assembly process. So the idea is that if they record So what are some of the surprising environmental So my recollection is that one of the finding or actually the model is still being built, of conference, so I don't remember the specifics the data that was coming off a car the performance data of the car for insurance or for... So in the case they were doing a simplistic visualization So they were using NiFi with all the right proxy-- to bring it all the way into a secure environment. So it was quite impressive from the point of view of, I think they were using a very lightweight platform. the data center because, you know, for competitive reasons, going to change the way applications are built? so that you don't have to think about and reason about it But they have to make a trade-off if they do it that way. so the whole idea is that if you can build and for the applications you can build with it. because when you look at a typical scenario, not just the calculation. So being able to actually have a system that delivers but how do you see applications changing now that, so the fact that you could use your knowledge There's so much more to go through. that can take the two of them
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
IBM | ORGANIZATION | 0.99+ |
George | PERSON | 0.99+ |
Hortonworks | ORGANIZATION | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Scott Gnau | PERSON | 0.99+ |
Rob Bearden | PERSON | 0.99+ |
Audi | ORGANIZATION | 0.99+ |
Rob Thomas | PERSON | 0.99+ |
San Jose | LOCATION | 0.99+ |
George Chow | PERSON | 0.99+ |
Ford | ORGANIZATION | 0.99+ |
last week | DATE | 0.99+ |
Silicon Valley | LOCATION | 0.99+ |
one millisecond | QUANTITY | 0.99+ |
two | QUANTITY | 0.99+ |
next year | DATE | 0.99+ |
100 milliseconds | QUANTITY | 0.99+ |
200 millisecond | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
tomorrow | DATE | 0.99+ |
Volkswagon | ORGANIZATION | 0.99+ |
this week | DATE | 0.99+ |
Google Map | TITLE | 0.99+ |
AdTech | ORGANIZATION | 0.99+ |
DataWorks 2017 | EVENT | 0.98+ |
DataWorks Summit 2017 | EVENT | 0.98+ |
both | QUANTITY | 0.98+ |
11 | DATE | 0.98+ |
Spark | TITLE | 0.98+ |
Wikibon | ORGANIZATION | 0.96+ |
under 10 | QUANTITY | 0.96+ |
one | QUANTITY | 0.96+ |
20 milliseconds | QUANTITY | 0.95+ |
Spark Summit | EVENT | 0.94+ |
first solution | QUANTITY | 0.94+ |
SQL | TITLE | 0.93+ |
hundred milliseconds | QUANTITY | 0.93+ |
2.2 | QUANTITY | 0.92+ |
one way | QUANTITY | 0.89+ |
Spark | ORGANIZATION | 0.88+ |
Lambda Architecture | TITLE | 0.87+ |
Kafka | TITLE | 0.86+ |
minute | QUANTITY | 0.86+ |
Porche Macan | ORGANIZATION | 0.86+ |
about 100 | QUANTITY | 0.85+ |
ODBC | TITLE | 0.84+ |
DataWorks | EVENT | 0.84+ |
NiFi | TITLE | 0.84+ |
about an hour ago | DATE | 0.8+ |
JDBC | TITLE | 0.79+ |
Raspberry Pi | COMMERCIAL_ITEM | 0.76+ |
Simba | ORGANIZATION | 0.75+ |
Simba Technologies | ORGANIZATION | 0.74+ |
couples weeks' | QUANTITY | 0.7+ |
CTO | PERSON | 0.68+ |
theCUBE | ORGANIZATION | 0.67+ |
twin | QUANTITY | 0.67+ |
couple weeks | QUANTITY | 0.64+ |
Matt Fryer, Hotels.com - #SparkSummit - #theCUBE
>> Announcer: Live from San Francisco, it's The Cube. Covering Spark Summit 2017. Brought to you by Databricks. >> The Cube is live once again from Spark Summit 2017, I'm David Goad, your host, here with George Gilbert, and we are interviewing many of the speakers that we saw on stage this morning at the keynote. Happy to introduce our next guest on the show, his name is Matt Fryer, Matt, how're you doing? >> Matt: Very well. >> You're the chief, Chief Data Science Officer, I don't see many CDSOs out there, is that a common-- >> I think to say, it's a newer title, and it's coming, I think, where companies that feel the use of data, data science and algorithms, are fundamental to their, their futures. They're creating both the mix of commercial, technical, and algorithmic skill sets, this one team, and to execute together, and that's where the title came from. There's more coming, there's a number of-- Facebook have a few, that's one for example, but it's a newer title, I think it's going to become larger and larger, as time goes on. >> David: So, the CDSO for Hotels.com, something else we learned about you that you may not want me to reveal, but I heard you were the inspiration for Captain Obvious, is that true? >> Uh, that's not true. (laughter) I think Captain Obvious is only an expression of my brand, so there's an awesome brand team, at our office in Dallas. (crosstalk) We all love the captain, he has some good humorous moments, and he keeps us all kind of happy. >> Oh, yeah, he states the obvious, we're going to talk about some of the obvious, and maybe some of the not-so obvious here in this interview. So let's talk a little bit about company culture, because you talked a lot on the stage this morning about customer-first kind of approach, rather than a, "Ooh, look what I can do with the technology." Talk a little bit more about the culture at Hotels.com. >> And that's important, and I think, we're a very data-driven culture, I think most tech companies, and travel, technology companies have that kind of ethos. But fundamentally, the focus and the reason we exist is for the customer. So we want to bring, and actually-- in even better ways than that, I think it's the people. So whether it's the focus on the customer, if we did the right thing by the customer, we fundamentally want you to use our platform time and time again. Whatever need you have, booking, lodging and travel, please use our platform. That's the crucial win. So, to do that, we have to always delight you in every experience you have with us. And equally about people, it's about the team, so we have an internal concept called being supportive. So the whole part of our team culture, is that everybody helps everybody else out, we don't single things out, we're all part of the same team, and we all win if all of us pull together. That makes it a great place, a fun place to work, we're going to play with some new technologies, tech is important to us, but actually the people are even more important to us. >> In part why you love the Spark Summit then, huh? Same kind of spirit here, right? >> It's great, I think it's my third Spark Summit, my second time over in San Francisco, and the size of it is very impressive now. I just love meeting other people learning about some of the things they're up to, how we can apply those back to our business, and hopefully sharing a little bit of what we're up to. >> David: Let's dive into how you're applying it to your business, you talked about this evolution toward becoming an algorithm business, what does that mean and what part does Spark play in that? >> Matt: I think what it is, is about how do you, if you think about a bit of the journey, historically, a lot of the opportunity came in building new features, constantly building it, it's almost like a semi arms race, about how to build more and more features. The crucial thing I think going forward, and particularly with mobile devices now, we have over half our traffic, comes from people using smartphones, on both the app and mobile web. That bringing together means that, be more targeted, in understanding your journey, and people are, last on to time, speed is much more important, people expect things to be right there when they need it, relevance is much more important to people, so we need to bring all those things together to offer a much more targeted experience, and a much more real-time experience. People expect you to have understood what they did milliseconds ago, and respond to that. The only way you can do that is using data science and algorithms. You balance out on a business operation side, just how do you scale? The analogy I use with, say, anomaly detection, which is a crucial feature for enterprises. Used to have a large business intelligence, lots of reports, pages of paper, now people have things like Tablo, Power BI, those are great and you need those to start with, but really as a business leader, you want to know, "Tell me what's broken, tell me what's changed, "because if it's changed something caused the change, "tell me why it's slowly moving, and most importantly, "tell me where the opportunity is." And that transforms the conversation where algorithms can really surface that to users, and it's about organic intelligence, it's not about artificial intelligence, it's about how would you bring together the people, and the advance in technology to really do a great job for customers. >> David: Well, you mentioned AI, you made a big bold claim about AI, I'm going to ask George to weigh in on this in just a moment, you said AI was going to be the next big thing in the travel industry, can you explain? >> One of the next big things, I think. Yeah, I think it's already happening, in fact, our chairman, Mr. Diller made that statement very recently, also backed up by both the CEO and the brand president, where it's... If you think about 20 years ago, one of the things both Expedia and Hotels.com, and travel online space did, were democratize price information, and made it transparent to users. So previously, the power was with the travel agents, that power moved to the user, they had the information. And that's evolved over time, and what we feel with artificial intelligence, particularly organic intelligence, enablers like mobile, messaging and having conversations, have a machine learning how to make this happen, that you can turn the screen around and actually empower users always with the second revolution. They actually have the advice, and the benefits you had a number of years ago from travel agents: A, they had the price transparency, they have the other part now, which is the content, advice, and what's the most relevant to help them. And you can listen to what they're saying to you, as a customer, and actually we can now replay the perfect information back to them, or increasingly perfect as time goes on. (crosstalk) >> That is fascinating, 'cause in the way you broke that out, with--it wasn't actually only travel, but over the last couple decades, price transparency became an issue for many industries, but what you're saying now is, by giving the content to surprise and delight the customer, as long as you're collecting the data breadcrumbs to help you do that, you're not giving up control, you're actually creating stickiness. >> Matt: We're empowering, is the language I use. And if you empower the user, the more likely to come back to use your service in the future, and that's really what we want, we want happy customers. >> George: Tell us a little bit, at the risk of dropping a little in the wait, tell us a little bit about how you empower, in other words, how do you know what type of content to serve up, and how do you measure how they engage with it? >> It's a great question, and I think it's quite embryonic, part of the world right now. I don't think anybody's-- have we made some great developments? I said it was a long journey we have, but it's a lot about how do you, and this is true across data science machine learning, great data science is fundamental to having great feedback loops. So, there's lots of different techniques and tactics around how you might discover those feedback loops, and customers demand that you use their data to help them. So, we need to get faster, and streaming is one way, that's becoming feasible, and the advances in streaming and it's great Databricks are working on that, but the advances in streaming allows it to feed that loop, to take that much--those real-time signals, as well as previous signals, to really help figure out what you're trying to do today, what content-- interesting thing is, Netflix and Amazon were some pioneers in this space, where if you use Netflix service, often you go, "How the hell did they know "this video was going to be right for me?" And, some of the comments, and you can say, well, what they're actually doing is they're looking at microsegments, so previously everyone talked about custom segments as these very large groups, and they have their place, but increasing machine learning allows you to build microsegments. What I can start to do is actually discover from the behavior of others, things you likely-- very relevant things that you're going to be very interested in, and actually help inspire you and discover things you didn't even know existed. And by filling that gap and using those microsegments as well as put truly personal, personalization, I can bring that together to offer you a much more enhanced service. >> George: And so, help make that concrete in terms of, what would I as a potential--I want to plan a vacation for the summer, I have my five and a half inch or, five-seven iPhone, and that's my primary device. And in banking, it's moved from tying everything to the checking account, to tying every interaction to your mobile device. So what would you show me on my mobile device, that would get me really engaged about going to some location? >> So I think a lot of it is about where you are in that journey. So, you think, there's so many different routes customers can take, through that buying decision. And depends on the trip type, whether it's a leisure trip, seeing your family and friends, how much knowledge you may have about them, have you been there before? We look for all those signals, to try and help inspire. So a great example might be, if you stayed in a hotel on our site before, and you liked that hotel, and you come back and do a search again, we try and make it easy to continue by putting that hotel at the top. Trying to make it easy to task-complete. We have a trip planner capability you'll see on the home screen, which allows you to record and play back some of your previous searches, so you can quickly see and compare where you've been, and what's interesting for you. But on top of that, we can then use the signals, and increasingly, we have a very advanced filter list, and that's a key, and we're looking in stuff, how we do conversations in the chatbox, is this sort of future, how to have a conversation to say, "Hey, here's a list of hotels, which we used a mix of your, "the types of preferences understood about you, "and the wider thing, where you are in the world, "what's going on, what time of day." We take hundreds of different signals to try and figure out what the right list is for you, and from that list, the great thing is most people interact with that list and give us more signals, exactly what you wanted. We can hone and hone and hone, and repeat, 'cause I said at the start, for example, those majority of customers will do multiple searches. They want to understand what the market is, they may not be interested in one particular place, they may have a sweeter place there instead. Even now, where we've moved further up the funnel, investing behind, how can you figure out what destination you're interested in? So you may not even know what destination you're interested in, or there might be other destinations that you didn't know--with a very relevant for your use case, particularly if you're going on vacation, we can help inspire you to find that hidden gem, that hidden great prize, you may not even know it existed. Being the much better job, but to show you how busy the market is, to how fast you should be looking to book there, if it's a very compressed, busy market, you need to get in there quick to lock your price in, and we're now providing that information to help you make a better decision. And we can mine all that data, to empower you to make smart decisions with smart data. >> I want to clarify something I saw in your demonstration this morning, you were talking about detecting the differences between photos and user-generated content, so do you have users actually posting their own photos of the hotel, right next to the photoshopped pictures of the hotel? >> Matt: We do, yeah. >> David: What are the ramifications of that? >> So it's an interesting advancement we've made, so we've... In the last of the year, we now offer and asking users to submit their photos, to help other users. I think one of the crucial things is about how to be authentic. Over the years, we've had tens of millions of testimonial reviews, text reviews, and we can see they're really, crucially important to users, and their buying decisions. >> David: It scares the hotel owners to death though, doesn't it? >> Matt: Well, I think it does, but I think the testimony of the customer, could be one of the key things we call them, as we have verified reviews, so to leave a review on our site, you've had to stay in that hotel. We think that's a crucial step in really helping to say, "These are your customers." In recent times, we've taken that product further, to now when you actually arrive at the hotel within a few hours, We'll ask you what your first impressions were. We would ask if you want to share that with the hotel owner. To get the hotel owner a chance to actually rectify any early challenges, so you can have a great stay. And one of the crucial things we have is that, what's really, really important, is that users and customers have a great stay, that reflects on our Net Promoter score, and their view of us, and we need to fill that cycle and make sure we have happy users. So that real-time review is super crucial, in basing how can hotels--if they want happy users and customers as well, it helps them to cut a course correct, if there's an issue, and we can step in as well to help the user if it's a really deep issue. And then with the photos, the key to think is how to navigate and understand what the photo is, so the user helps us by tagging that, which is great, but how we-- >> David: Possibly mistagging it. >> Possibly mistagging it on occasion, that's something we've, we've built in some skill as you've heard, on how to tackle that, but the crucial thing is how to bring these together, if you're on a mobile device, you've got to scan through each photo, and in places around the world have limited bandwidth, a limited time to go through them, so what we're now working on is how to assess the quality of those photos, to try and make sure we authentically--what we want to do, is get the customer the most lively experience they will have. As I said before, we're on the customer's kind of focus, we want to make sure they get the best photos, the most realistic of what's going to happen, and doing the most diverse. You want to see three photos, exactly the same, and we're working on the moment, you can swipe left and swipe right, we're working on how that display evolves over time, but it's exciting. >> David: Very exciting, fascinating stuff. Sorry that we're up against a hard break, coming here in just a moment, but I wanted to give you just 30 seconds to kind of sum up, maybe the next big technical challenge you're looking at that involves Spark, and we'll close with that. >> Cool, it's a great question. I think I talked a little about that in the keynote, totally caught the kind of out challenge. How to scale a mountain, which has been-- there's been great advance on how to stream data into platforms, Spark is a core part of that, and the platforms that we've been building, both internally, and partnering with Databricks and using their platform, has really given us a large boost going forwards, but how you turn those algorithms and that competitive algorithmic advantage, into a live production environment, whether it's marketplaces, Adtech marketplaces or websites, or in call centers, or in social media, wherever the platform needs to go, that's a hard problem right now. Or, I think it's too hard a problem right now. And I'd love to see--and we're going to invest behind that, a transformation, that hopefully this time next year, that is no longer a problem, and is actually an asset. >> David: Well I hope I'm not Captain Obvious to say, I know you're up to the challenge. Thank you so much, Matt Fryer, we appreciate you being on the show, thank you for sharing what's going on at Hotels.com. And thank you all for watching The Cube, we'll be back in a few moments with our next guest, here at Spark Summit 2017. (electronic music) (wind blowing)
SUMMARY :
Brought to you by Databricks. and we are interviewing many of the speakers and to execute together, something else we learned about you that We all love the captain, he has some good humorous moments, and maybe some of the not-so obvious here in this interview. So, to do that, we have to always delight you and the size of it is very impressive now. and the advance in technology to really do and the benefits you had a number of years ago to help you do that, you're not giving up control, And if you empower the user, the more likely to come back And, some of the comments, and you can say, well, So what would you show me on my mobile device, Being the much better job, but to show you how busy and we can see they're really, crucially important to users, to now when you actually arrive at the hotel but the crucial thing is how to bring these together, coming here in just a moment, but I wanted to give you just and the platforms that we've been building, we appreciate you being on the show, thank you for sharing
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
Matt Fryer | PERSON | 0.99+ |
David | PERSON | 0.99+ |
George | PERSON | 0.99+ |
David Goad | PERSON | 0.99+ |
Diller | PERSON | 0.99+ |
Matt | PERSON | 0.99+ |
Dallas | LOCATION | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Hotels.com | ORGANIZATION | 0.99+ |
Expedia | ORGANIZATION | 0.99+ |
Netflix | ORGANIZATION | 0.99+ |
five and a half inch | QUANTITY | 0.99+ |
ORGANIZATION | 0.99+ | |
30 seconds | QUANTITY | 0.99+ |
The Cube | TITLE | 0.99+ |
second revolution | QUANTITY | 0.99+ |
hundreds | QUANTITY | 0.99+ |
Spark Summit 2017 | EVENT | 0.99+ |
next year | DATE | 0.99+ |
third | QUANTITY | 0.98+ |
both | QUANTITY | 0.98+ |
one | QUANTITY | 0.98+ |
iPhone | COMMERCIAL_ITEM | 0.98+ |
Spark Summit | EVENT | 0.98+ |
three photos | QUANTITY | 0.98+ |
Databricks | ORGANIZATION | 0.98+ |
each photo | QUANTITY | 0.97+ |
first impressions | QUANTITY | 0.97+ |
second time | QUANTITY | 0.96+ |
one way | QUANTITY | 0.93+ |
Power BI | TITLE | 0.92+ |
about 20 years ago | DATE | 0.91+ |
One | QUANTITY | 0.91+ |
tens of millions | QUANTITY | 0.91+ |
five-seven | QUANTITY | 0.9+ |
Spark | ORGANIZATION | 0.89+ |
today | DATE | 0.88+ |
last couple decades | DATE | 0.87+ |
this morning | DATE | 0.86+ |
one team | QUANTITY | 0.85+ |
#SparkSummit | EVENT | 0.8+ |
number of years ago | DATE | 0.77+ |
seconds ago | DATE | 0.77+ |
Tablo | TITLE | 0.76+ |
Captain Obvious | PERSON | 0.74+ |
signals | QUANTITY | 0.72+ |
first | QUANTITY | 0.7+ |
over | QUANTITY | 0.66+ |
#theCUBE | TITLE | 0.6+ |
Cube | COMMERCIAL_ITEM | 0.39+ |
Cube | TITLE | 0.37+ |
Wesley Kerr, Riot Games - #SparkSummit - #theCUBE
>> Announcer: Live from San Francisco, it's theCUBE covering Spark Summit 2017. Brought to you by Databricks. >> Getting close to the end of the day here at Spark Summit, but we saved the best for last I think. I'm pretty sure about that. I'm David Goad, your host here on theCUBE and we now have data scientists from Riot Games, yes, Riot Games. His name is Wesley Kerr. Wesley, thanks for joining us. >> Thanks for having me. >> What's the best money-making game at Riot Games? >> Well we only have one game. We're known for League of Legends. It came out in 2009, it has been growing and well-received by our fans since then. >> And what's your role there? It says data scientist, but what do you really do? >> So we build models to look at things like in game behavior. We build models to actually help players engage with our store and buy our content. We look at different ways we can, just, improve our player experience. >> Alright well let's talk about a little more under the hood, here. How are you deploying Spark in the game? >> So we relied on Databricks for all of our deployment. We do many different clusters. We have about 14 data scientists that work with us, each one is sort of able to manage their own clusters: spin 'em up, tear 'em down, find their data that way and work with it through Databricks. >> So what else will you cover? You had a keynote session this morning, right? >> Yep. >> Give a recap for theCUBE audience of what you talked about. >> So we talked about our efforts in player behavior where we build models and deploy models that are watching chat between players so we evaluate whether or not players are being unsportsmanlike and come up with ways to, sort of, help them curb that behavior and be more sportsmanlike in our game. >> Oh wow, unsportsmanlike. How do you define that? It's if people are being abusive? >> So what we saw was there are about one or two percent of our games that is some form of serious abuse and that comes in term of hate speech, racism, sexism, things that have no place in the game and so we want them to realize that that language is bad and they shouldn't be using it. >> It's all key word driven or are there other behaviors or things that can indicate? >> So right now it's purely based on things said in chat, but we're currently investigating other, sort of, other ways of measuring that behavior and how it occurs in game and how it could influence what people are saying. >> Maybe like tweets coming from The White House? (laughing) >> Okay, so George. >> We should be able to measure that as well. >> So how about those warriors? (laughing) >> No, George did you want to talk a little bit more >> Sure. >> David: about the technical achievements here? When you look at like trying to measure engagement and sort of maybe it sounds like converting high engagement to store purchases, tell us a little more maybe how that works. >> So we look at, we want. Our game is completely free to play. Players can download, play it all the way through and we really try to create a very engaging game that they want to come back and they want to play and then everything they can buy in the store is actually just cosmetics. So we really hope to build content that our players love and are happy to spend money on. As far as... We just really want engagement to be from around players coming back and playing and having a good time and it's less about how to get that high engagement conversion into monetization as we've seen that players who are happy and loving the game are happy to spend their money. >> So tell us more about how you build some of these models like, you know, turning it into not turning it into Spark code, but how do you analyze it and, sort of, what's the database mechanism for, you know, 'cause the storage layer in Spark, you know, is just like the file system? >> Sure, yeah absolutely. So we are a world-wide game. We're played by over 100 million players around the world >> David: Wow. >> And so that data comes flowing in from all around the world into our centralized data warehouse. That data warehouse has gameplay data so we know how you did in game. It also has time series events, so things that occurred in each game. And our game is really session based so players can come play for an hour, that's one game, and then they leave and come back and play again. And so what we're able to do is then, sort of, look at those models and how they did. And I'll give you an example around our content recommendations. So we look at the champions that you've been playing recently to predict which champions you are likely to play next. And that we can actually just query the database, start building our collaborative filtering models on top of it, and then recommend champions that you may not play now, you may be interested in playing, or we may decide to give you a special discount on a champion if we think it'll resonate well with you. >> And in this case, just to be clear, the champions you're talking about are other players, not models? >> It's actually the in-game avatar. So it's the champion that they play. So we have 130 unique champions and each game you choose which champion you want to play and so then that plays out for like. It's much more like a sport than it is like a game. So it's five v five, online competitive. So there are different objectives on the map. You work with your team to complete those objectives and beat the other team. So we like to think of it like basketball, but with magic and in a virtual world. >> And the teams stay together? Or are they constantly recombining? >> They can disband, yeah. Your next game may find nine other people. If you're playing with your friends then you can just keep queuing up with them as well. So the champions that they control there happen to be who you're playing in that game. >> And when you are trying to anticipate champions that someone might play in the future, what are the variables that you're trying to guess and how long did it take you to build those models? >> Yeah, it's a good question. Right now we are able to sort of leverage the power of our user, our players, so we have 100 million. And so what we do and we have in our game there are roles so, for instance, like there's a center in basketball, we have a bot lane. So we have bottom lane support and bottom lane ADC. So a support character is there to make sure that your ADC is able to defeat the other team. And if you play a lot of support, odds are there are other players in the world who play a lot of support too so we find similar players. We find that if they engaged on the same sorts of champions that you play. For instance, I'm a Leona main and so I play her a lot. And if I were to look at what other people played in addition to Leona it could be things like Braum and so then we would recommend Braum as a champion that you should try out that you've maybe not played yet. >> David: Okay. >> So and then what's the data warehouse that you guys use for the ultimate repository of all this? >> All the data flows into a Hive data warehouse, stored in S3. We have two different ways of interacting with it. One, we can run queries against Hive. It tends to be a bit slower for our use cases. And then our data scientists tend to access that all that data through Databricks and Spark. And it runs much quicker for our use cases. >> Do you take what's in S3 and put it into a parquet format to accelerate? >> Sometimes, so we do some of those rewrites. We do a lot of our secondary ETLs where we're just joining across multiple tables and writing back out. We'll optimize those for our Spark use cases and there's writing back, sort of, read from S3, do some transformations, write back to S3. >> And how latency-sensitive is this? Are you guys trying to make decisions as the player moves along in his level or? >> So historically we've been batch. We do- our recommendations are updated weekly so we haven't needed a much higher cadence. But we're moving to a point where I want to see us be able to actually make recommendations on the client and do it immediately after you've finished a game with, say, Leona, here's an offer for Braum. Go check it out, give it a try in your next game. >> So Wesley what would you like to see developed that hasn't been developed yet that would really help in your business specifically? >> So one thing that's really exciting for gaming right now is procedural generation and artificial intelligence. So here there are a lot of opportunities, you've seen some collaborations between Deep Mind and Blizzard where they're learning to play Starcraft. For me, I think there's a similar world where we have a game that has different sorts of mechanics. So we have a large social piece to our game and teamwork is required. And so understanding how we can leverage that and help influence the future of artificial intelligence is something that I want to see us be able to do. >> Did you talk with anybody here at the Spark Summit about that? >> Anyone who would listen. (laughing) So we chatted some with the teams up at Blizzard and Twitch about some of the things they're doing for natural language as well. >> Alright so what was the most useful conversation you had here at the summit? >> The most useful one that I had, I think, was with the Databricks team. So at the end of my keynote, It was kind of serendipitous, I was talking about some work we had done with deep learning and sort of doing hyper parameter searches over our worker nodes, so actually being able to quickly try out many different models. And in the announcement that morning before my keynote, Tim talked about how they actually have deep learning pipelines now and it was based on a conversation we had had so I was very excited to see it come to fruition and now is open source and we can leverage it. >> Awesome, well, we're up against a hard break here. >> Wesley: Okay. >> We're almost at the end of the day. Wesley, it's been a riot talking to you. We really appreciate it and thank you for coming on the show and sharing your knowledge. >> Wesley: You bet, thanks for having me. >> Alright and that's it, we're going to wrap it up today. We have a wrap-up coming up, as a matter of fact, in just a few minutes. My name is David Goad. You're watching theCUBE at Spark Summit. (upbeat music)
SUMMARY :
Brought to you by Databricks. and we now have data scientists Well we only have one game. So we build models to look at things How are you deploying Spark in the game? So we relied on Databricks for all of our deployment. of what you talked about. So we talked about our efforts in player behavior How do you define that? and so we want them to realize that that language is bad and how it occurs in game and how it could influence When you look at like trying to measure engagement So we really hope to build content So we are a world-wide game. so we know how you did in game. So it's the champion that they play. So the champions that they control there happen and so then we would recommend Braum as a champion One, we can run queries against Hive. Sometimes, so we do some of those rewrites. so we haven't needed a much higher cadence. And so understanding how we can leverage that So we chatted some with the teams up at Blizzard and it was based on a conversation we had had We really appreciate it and thank you Alright and that's it, we're going to wrap it up today.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Wesley | PERSON | 0.99+ |
David Goad | PERSON | 0.99+ |
2009 | DATE | 0.99+ |
Wesley Kerr | PERSON | 0.99+ |
David | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Blizzard | ORGANIZATION | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Tim | PERSON | 0.99+ |
Deep Mind | ORGANIZATION | 0.99+ |
League of Legends | TITLE | 0.99+ |
100 million | QUANTITY | 0.99+ |
one game | QUANTITY | 0.99+ |
each game | QUANTITY | 0.99+ |
Twitch | ORGANIZATION | 0.99+ |
Braum | PERSON | 0.99+ |
Riot Games | ORGANIZATION | 0.99+ |
S3 | TITLE | 0.99+ |
over 100 million players | QUANTITY | 0.99+ |
Spark | TITLE | 0.99+ |
today | DATE | 0.98+ |
Spark Summit 2017 | EVENT | 0.98+ |
Starcraft | TITLE | 0.98+ |
One | QUANTITY | 0.98+ |
an hour | QUANTITY | 0.98+ |
each one | QUANTITY | 0.98+ |
Databricks | ORGANIZATION | 0.98+ |
five | QUANTITY | 0.97+ |
Spark Summit | LOCATION | 0.97+ |
Leona | PERSON | 0.96+ |
Spark Summit | EVENT | 0.95+ |
130 unique champions | QUANTITY | 0.94+ |
nine other people | QUANTITY | 0.94+ |
two different ways | QUANTITY | 0.94+ |
about 14 data scientists | QUANTITY | 0.91+ |
this morning | DATE | 0.89+ |
two percent | QUANTITY | 0.84+ |
one thing | QUANTITY | 0.82+ |
The White House | ORGANIZATION | 0.79+ |
that morning | DATE | 0.72+ |
about one | QUANTITY | 0.72+ |
#SparkSummit | EVENT | 0.67+ |
theCUBE | ORGANIZATION | 0.63+ |
Hive | TITLE | 0.61+ |
Clarke Patterson, Confluent - #SparkSummit - #theCUBE
>> Announcer: Live from San Francisco, it's theCUBE. covering Spark Summit 2017, brought to you by Databricks. (techno music) >> Welcome to theCUBE, at Spark Summit here at San Francisco, at the Moscone Center West, and we're going to be competing with all the excitement happening behind us. They're going to be going off with raffles, and I don't know what all. But we'll just have to talk above them, right? >> Clarke: Well at least we didn't get to win. >> Our next guest here on the show is Clarke Patterson from Confluent. You're the Senior Director of Product Marketing, is that correct? >> Yeah, you got it. >> All right, well it's exciting -- >> Clarke: Pleasure to be here >> To have you on the show. >> Clarke: It's my first time here. >> David: First time on theCUBE? >> I feel like one of those radio people, first time caller, here I am. Yup, first time on theCUBE. >> Well, long time listener too, I hope. >> Clarke: Yes, I am. >> And so, have you announced anything new that you want to talk about from Confluent? >> Yeah, I mean not particularly at this show per se, but most recently, we've done a lot of stuff to enable customers to adopt Confluent in the Cloud. So we came up with a Confluent Cloud offering, which is a managed service of our Confluent platform a couple weeks ago, at our event around Kafka. So we're really excited about that. It really fits that need where Cloud First or operation-starved organizations are really wanting to do things with storing platforms based on Kafka, but they just don't have the means to make it happen. And so, we're now standing this up as a managed service center that allows them to get their hands on this great set of capabilities with us as the back stop to do things with it. >> And you said, Kafka is not just a publish and subscribe engine, right? >> Yeah, I'm glad that you asked that. So, that one of the big misconceptions, I think, of Kafka. You know, it's made its way into a lot of organizations from the early use case of publish and subscribe for data. But, over the last 12 to 18 months, in particular, there's been a lot of interesting advancements. Two things in particular: One is the ability to connect, which is called a Connect API in Kafka. And it essentially simplifies how you integrate large amounts of producers and consumers of data as information flows through. So, a modernization of ETL, if you will. The second thing is stream processing. So there's a Kafka streams API that's built-in now as well that allows you to do the lightweight transformations of data as it flows from point A to point B, and you could publish out new topics if you need to manipulate things. And it expands the overall capabilities of what Kafka can do. >> Okay, and I'm going to ask George here to dive in, if you could. >> And I was just going to ask you. >> David: I can feel it. (laughing) >> So, this is interesting. But if we want to frame this in terms of what people understand from, I don't want to say prehistoric eras, but earlier approaches to similar problems. So, let's say, in days gone by, you had an ETL solution. >> Clarke: Yup. >> So now, let's put Connect together with stream processing, and how does that change the whole architecture of integrating your systems? >> Yeah, I mean I think the easiest way to think about this is if you think about some of the different market segments that have existed over the last 10 to 20 years. So data integration was all about how do I get a lot of different systems to integrate a bunch of data and transform it in some manner, and ship it off to some other place in my business. And it was really good at building these end-to-end workflows, moving big quantities of data. But it was generally kind of batch-oriented. And so we've been fixated on, how do we make this process faster? To some degree, and the other segment is application integration which said, hey, you know when I want applications to talk to one another, it doesn't have the scale of information exchange, but it needs to happen a whole lot faster. So these real-time integration systems, ESBs, and things like that came along and it was able to serve that particular need. But as we move forward into this world that we're in now, where there's just all sorts of information, companies want to become advanced-centric. You need to be able to get the best of both of those worlds. And this is really where Kafka is starting to sit. It's saying, hey let's take massive amounts of data producers that need to connect to massive amounts of data consumers, be able to ship a super-granular level of information, transform it as you need, and do that in real-time so that everything can get served out very, very fast. >> But now that you, I mean that's a wonderful and kind of pithy kind of way to distill it. But now that we have this new way of thinking of app integration, data integration, best of both worlds, that has sort of second order consequences in terms of how we build applications and connect them. So what does that look like? What do applications look like in the old world and now what enables them to be sort of re-factored? Or for new apps, how do you build them differently? >> Yeah, I mean we see a lot of people that are going into microservices oriented architecture. So moving away from one big monolithic app that takes this inordinate amount of effort to change in some capacity. And quite frankly, it happens very, very slow. And so they look to microservices to be able to split those up into very small, functional components that they can integrate a whole lot faster, decouple engineering teams so we're not dependent on one another, and just make things happen a whole lot quicker than we could before. But obviously when you do that, you need something that can connect all those pieces, and Kafka's a great thing to sit in there as a way to exchange state across all these things. So that's a massive use case for us and for Kafka specifically in terms of what we're seeing people do. >> You've said something in there at the end that I want to key off, which is, "To exchange state." So in the old world, we used a massive shared database to share state for a monolithic app or sometimes between monolithic apps. So what sort of state-of-the-art way that that's done now with microservices, if there's more than one, how does that work? >> Yeah, I mean so this is kind of rooted in the way we do stream processing. So there's this concept of topics, which effectively could align to individual microservices. And you're able to make sure that the most recent state of any particular one is stored in the central repository of Kafka. But then given that we take an API approach to stream processing, it's easy to embed those types of capabilities in any of the end-points. And so some of the activity can happen on that particular front, then it all gets synchronized down into the centralized hub. >> Okay, let me unpack that a little bit. Because you take an API approach, that means that if you're manipulating a topic, you're processing a microservice and that has state in it? Is that the right way to think about it? >> I think that's the easiest way to think about it, yeah. >> Okay. So where are we? Is this a 10 year migration, or is it a, some certain class of apps will lend themselves well to microservices, legacy apps will stay monolithic, and some new apps, some new Greenfield apps, will still be database-centric? How do you, or how should customers think about that mix? >> Yeah that's a great question. I don't know that I have the answer to it. The best gauge I can have is just the amount of interest and conversations that we have on this particular topic. I will say that from one of the topics that we do engage with, it's easily one of the most popular that people are interested in. So if that's a data point, it's definitely a lot of interested people trying to figure out how to do this stuff very, very fast. >> How to do the microservices? >> Yeah and I think if you look at some of the more notable tech companies of late, they're architected this way from the start. And so everyone's kind of looking at the Netflix of the world, and the Ubers of the world saying, I want to be like those guys, how do I do that? And it's driving them down this path. So competitive pressure, I think, will help force people's hands. The more that your competitors are getting in front of you and are able to deliver a better customer experience through some sort of mobile app or something like that, then it's going to force people to have to make these changes quicker. But how long that takes it'll be interesting to see. >> Great! Great stuff. Switch gears just a little bit. Talk about maybe why you're using Databricks and what some of the key value you've gotten out of that. >> Yeah, so I wouldn't say that we're using Databricks per se, but we integrate directly with Spark. So if you look at a lot of the use cases that people use Spark for, they need to obviously get data to where it is. And some of the principles that I said before about Kafka generally, it's a very flexible, very dynamic mechanism for taking lots of sources of information, culling all that down into one centralized place and then distributing it to places such as Spark. So we see a lot of people using the technologies together to get the data from point A to point B, do some transformation as they so need, and then obviously do some amazing computing horsepower and whatnot in Spark itself. >> David: All right. >> I'm processing this, and it's tough because you can go in so many different directions, especially like the question about Spark. I guess, give us some of the scenarios where Spark would fit. Would it be like doing microservices that require more advanced analytics, and then they feed other topics, or feed consumers? And then where might you stick with a shared database that a couple services might communicate with, rather than maintaining the state within the microservice? >> I think, let me see if I can kind of unpack that myself a little bit. >> George: I know it was packed pretty hard. (laughing) >> Got a lot packed in there. When folks want to do things like, I guess when you think about it like an overall business process. If you think about something like an order to cash business process these days, it has a whole bunch of different systems that hang off it. It's got your order processing. You've got an inventory management. Maybe you've got some real-time pricing. You've got some shipments. Things, like that all just kind of hang off of the flow of data across there. Now with any given system that you use for addressing any answers to each of those problems could be vastly different. It could be Spark. It could be a relational database. It could be a whole bunch of different things. Where the centralization of data comes in for us is to be able to just kind of make sure that all those components can be communicating with each other based on the last thing that happened within each of them individually. And so their ability to embed transformation, data transformations and data processing in themselves and then publish back out any change that they had into the shared cluster subsequently makes that state available to everybody else. So that if necessary, they can react to it. So in a lot of ways, we're kind of agnostic to the type of processing that happens on the end-points. It's more just the free movement of all the data to all those things. And then if they have any relevant updates that need to make it back to any of the other components hanging on that process flow, they should have the ability to publish that back down it. >> And so one thing that Jay Kreps, Founder and CEO, talks about is that Kafka may ultimately, or in his language, will ultimately grow into something that rivals the relational database. Tell us what that world would look like. >> It would be controversial (laughing). >> George: That's okay. >> You want me to be the bad guy? So it's interesting because we did Kafka Summit about a month ago, and there's a lot of people, a lot of companies I should say, that are actually using and calling Kafka an enterprise data hub, a central hub for data, a data distribution network. And they are literally storing all sorts (raffle announcements beginning on loudspeaker) of different links of data. So one interesting example was the New York Times. So they used Kafka and literally stored every piece of content that has ever been generated at that publisher since the beginning of time in Kafka. So all the way back to 1851, they've obviously digitized everything. And it sits in there, and then they disposition that back out to various forms of the business. So that's -- >> They replay it, they pull it. They replay and pull, wow, okay. >> So that has some very interesting implications. So you can replay data. If you run some analytics on something and you didn't get the result that you wanted, and you wanted to redo it, it makes it really easy and really fast to be able to do that. If you want to bring on a new system that has some new functionality, you can do that really quickly because you have the full pedigree of everything that sits in there. And then imagine this world where you could actually start to ask questions on it directly. That's where it starts to get very, very profound, and it will be interesting to see where that goes. >> Two things then: First, it sounds, like a database takes updates, so you don't have a perfect historical record. You have a snapshot of current values. Like whereas in a log, like Kafka, or log-structured data structure you have every event that ever happened. >> Clarke: Correct. >> Now, what's the impact on performance if you want to pull, you know -- >> Clarke: That much data? >> Yeah. >> Yeah, I mean so it all comes down to managing the environment on which you run it. So obviously the more data you're going to store in here, and the more type of things you're going to try to connect to it, you're going to have to take that into account. >> And you mentioned just a moment ago about directly asking about the data contained in the hub, in the data hub. >> Clarke: Correct. >> How would that work? >> Not quite sure today, to be honest with you. And I think this is where that question, I think, is a pretty provocative one. Like what does it mean to have this entire view of all granular event streams, not in some aggregated form over time? I think the key will be some mechanism to come onto an environment like this to make it more consumable for more business types users. And that's probably one of the areas we'll want to watch to see how that's (background noise drowns out speaker). >> Okay, only one unanswered question. But you answered all the other ones really well. So we're going to wrap it up here. We're up against a loud break right now. I want to think Clarke Patterson from Confluent for joining us. Thank you so much for being on the show. >> Clarke: Thank you for having me. >> Appreciate it so much. And thank you for watching theCUBE. We'll be back after the raffle in just a few minutes. We have one more guest. Stay with us, thank you. (techno music)
SUMMARY :
covering Spark Summit 2017, brought to you by Databricks. They're going to be going off with raffles, is that correct? I feel like one of those radio people, but they just don't have the means to make it happen. Yeah, I'm glad that you asked that. Okay, and I'm going to ask George here to dive in, David: I can feel it. but earlier approaches to similar problems. that have existed over the last 10 to 20 years. But now that we have this new way of thinking And so they look to microservices to be able So in the old world, we used a massive shared database And so some of the activity can happen Is that the right way to think about it? So where are we? I don't know that I have the answer to it. But how long that takes it'll be interesting to see. and what some of the key value you've gotten out of that. and then distributing it to places such as Spark. And then where might you stick with a shared database that myself a little bit. George: I know it was packed pretty hard. So that if necessary, they can react to it. that rivals the relational database. that publisher since the beginning of time in Kafka. They replay it, they pull it. and really fast to be able to do that. or log-structured data structure you have every event the environment on which you run it. And you mentioned just a moment ago about directly And that's probably one of the areas we'll want to watch But you answered all the other ones really well. And thank you for watching theCUBE.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
David | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
Justin Warren | PERSON | 0.99+ |
Sanjay Poonen | PERSON | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
Clarke | PERSON | 0.99+ |
David Floyer | PERSON | 0.99+ |
Jeff Frick | PERSON | 0.99+ |
Dave Volante | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Diane Greene | PERSON | 0.99+ |
Michele Paluso | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Sam Lightstone | PERSON | 0.99+ |
Dan Hushon | PERSON | 0.99+ |
Nutanix | ORGANIZATION | 0.99+ |
Teresa Carlson | PERSON | 0.99+ |
Kevin | PERSON | 0.99+ |
Andy Armstrong | PERSON | 0.99+ |
Michael Dell | PERSON | 0.99+ |
Pat Gelsinger | PERSON | 0.99+ |
John | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
Lisa Martin | PERSON | 0.99+ |
Kevin Sheehan | PERSON | 0.99+ |
Leandro Nunez | PERSON | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
Oracle | ORGANIZATION | 0.99+ |
Alibaba | ORGANIZATION | 0.99+ |
NVIDIA | ORGANIZATION | 0.99+ |
EMC | ORGANIZATION | 0.99+ |
GE | ORGANIZATION | 0.99+ |
NetApp | ORGANIZATION | 0.99+ |
Keith | PERSON | 0.99+ |
Bob Metcalfe | PERSON | 0.99+ |
VMware | ORGANIZATION | 0.99+ |
90% | QUANTITY | 0.99+ |
Sam | PERSON | 0.99+ |
Larry Biagini | PERSON | 0.99+ |
Rebecca Knight | PERSON | 0.99+ |
Brendan | PERSON | 0.99+ |
Dell | ORGANIZATION | 0.99+ |
Peter | PERSON | 0.99+ |
Clarke Patterson | PERSON | 0.99+ |
Ali Ghodsi, Databricks - #SparkSummit - #theCUBE
>> Narrator: Live from San Francisco, it's the Cube. Covering Sparks Summit 2017. Brought to you by Databricks. (upbeat music) >> Welcome back to the Cube, day two at Sparks Summit. It's very exciting. I can't wait to talk to this gentleman. We have the CEO from Databricks, Ali Ghodsi, joining us. Ali, welcome to the show. >> Thank you so much. >> David: Well we sat here and watched the keynote this morning with Databricks and you delivered. Some big announcements. Before we get into some of that, I want to ask you, it's been about a year and a half since you transitioned from VP Products and Engineering into a CEO role. What's the most fun part of that and maybe what's the toughest part? >> Oh, I see. That's a good question and that's a tough question too. Most fun part is... You know, you touch many more facets of the business. So in engineering, it's all the tech and you're dealing only with engineers, mostly. Customers are one hop away, there's a product management layer between you and the customers. So you're very inwards focused. As a CEO you're dealing with marketing, finance, sales, these different functions. And then, externally with media, with stakeholders, a lot of customer calls. There's many, many more facets of the business that you're seeing. And it also gives you a preview and it also gives you a perspective that you couldn't have before. You see how the pieces fit together so you actually can have a better perspective and see further out than you could before. Before, I was more in my own pick situation where I was seeing sort of just the things relating to engineering so that's the best part. >> You're obviously working close with customers. You introduced a few customers this morning up on stage. But after the keynote, did you hear any reactions from people? What are they saying? >> Yes the keynote was recently so on my way here I've had multiple people sort of... A couple people that high-fived just before I got up on stage here. On several softwaring, people are really excited about that. Less devops, less configuration, let them focus on the innovation, they want that. So that's something that's celebrated. Yesterday-- >> Recap that real quickly for our audience here, what the server-less operating is. >> Absolutely, so it's very simple. We want lots of data scientists to be able to do machine learning without have to worry about the infrastructure underneath it. So we have something called server-less pools and server-less pools you can just have lots of data scientists use it. Under the hood, this pool of resources shrinks and expands automatically. It adds storage, if needed. And you don't have to worry about the configuration of it. And it also makes sure that it's isolating the different data scientists. So if one data scientist happened to run something that takes much more resources, it won't effect the other data scientists that are sharing that. So the short story of it is you cut costs significantly, you can now have 3000 people share the same resources and it enables them to move faster because they don't have to worry about all the devops that they otherwise have to do. >> George, is that a really big deal? >> Well we know whenever there's infrastructure that gets between a developer, data science, and their outcomes, that's friction. I'd be curious to say let's put that into a bigger perspective, which is if you go back several years, what were the class of apps that Spark was being used for, and in conjunction with what other technologies. Then bring us forward to today and then maybe look out three years. >> Ali: Yeah, that's a great question. So from the very beginning, data is key for any of these predictive analytics that we are doing. So that was always a key thing. But back then we saw more Hadoop data lakes. There more data lakes, data reservoirs, data marks that people were building out. We saw also a lot of traditional data warehousing. These days, we see more and more things moving to cloud. The Hadoop data lake received, often times at enterprises, being transformed into a cloud blob storage. That's cheaper, it's dual-up replicated, it's on many continents. That's something that we've seen happen. And we work across any of these, frankly. We, from the very beginning, Spark, one of its strengths is it integrates really well wherever your data is. And there's a huge community of developers around it, over 1000 people now that have contributed to it. Many of these people are in other organizations, they're employed by other companies and their job is to make sure that Databricks or Spark works really, really well with, say, Cassandra or with S3. That's a shift that we're seeing. In terms of applications people are building it's moving more into production. Four years ago much more of it was interactive exploratory. Now we're seeing production use cases. The fraud analytics use case that I mentioned, that's running continuously and the requirements there are different. You can't go down for ten minutes on a Saturday morning at 4 a.m. when you're doing credit card fraud because that's a lot of fraud and that affects the business of, say, Capital One. So that's much more crucial for them. >> So what would be the surrounding infrastructure and applications to make that whole solution work? Would you plug into a traditional system of record at the sales order entry kind of process point? Are you working off sort of semi-real-time or near real-time data? And did you train the models on the data lake? How did the pieces fit together? >> Unfortunately the answers depends on the particular architecture that the customer has. Every enterprise is slightly different. But it's not uncommon that the data is coming in. They're using Spark structured streaming in Databricks to get it into S3, so that's one piece of the puzzle. Then when it ends up there, from then on it funnels out to many different use cases. It could be a data warehousing use case, where they're just using interactive sequel on it. So that's the traditional interactive use case, but it could be a real-time use case, where it's actually taking those data that it's processed and it's detecting anomalies and putting triggers in other systems and then those systems downstream will react to those triggers for anomalies. But it could also be that it's periodically training models and storing the models somewhere. Often times it might be in a Cassandra, or in a Redis, or something of that sort. It will store the model there and then some web application can then take it from there, do point queries to it and say okay, I have a particular user that came in here George now, quickly look up what is his feature vector, figure out what the product recommendations we should show to this person and then it takes it from there. >> So in those cases, Cassandra or Redis, they're playing the serving layer. But generating the prediction model is coming from you and they're just doing the inferencing, the prediction itself. So if you look out several years, without asking you the roadmap, which you can feel free to answer, how do you see that scope of apps expanding or the share of an existing app like that? >> Yeah, I think two interesting trends that I believe in, I'll be foolish enough to make predictions. One is that I think that data warehousing, as we know it today, will continue to exist. However, it will be transformed and all the data warehousing solutions that we have today will add predictive capabilities or it will disappear. So let me motivate that. If you have a data warehouse with customer data in it and a fact table, you have all your transactions there, you have all your products there. Today, you can plug in BI tools and on top of that you can see what's my business health today and yesterday. But you can't ask it: tell me about tomorrow. Why not? The data is there, why can I not ask it this customer data, you tell me which of these customers are going to turn, or which one of them should I reach out to because I can possibly upsell these? Why wouldn't I want to do that? I think everyone would want to do that and everyday a warehousing solution in ten years will have these capabilities. Now with Spark sequel you can do that and the announcement yesterday showed you also how you can bake models, machinery models, and export them so a sequel analyst can just act system directly with no machine learning experience. It's just a simple function call and it just works. So that's one prediction I'll make. The second prediction I'll make is that we're going to see lots of revolutions in different industries, beyond the traditional 'get people to click on ads' and understand social behavior. We're going to go beyond that. So for those use cases it will be closer to the things I mentioned like Shell and what you need to do there is involve these domain experts. The domain experts will come in, the doctors, or the machine specialists, you have to involve them in the loop. And they'll be able to transform, maybe much less exotic applications, it's not the super high-tech Silicon Valley stuff, but it's nevertheless extremely important to every enterprise, to every protocol, on the planet. That's, I think, the exciting part of where predictions will go in the next decade or two. >> If I were to try and pick out the most man-bytes dug kind of observation in there, you know, it's supposed to be the unexpected thing, I would say where you said all data warehouses are going to become predictive services. Because what we've been hearing, it's sort of the other side of that coin which is all the operational databases will get all the predictive capabilities. But you said something very different. I guess my question is are you seeing the advanced analytics going to the data warehouse because the repository of data is going to be bigger there and so you can either build better models or because it's not burdened with transaction SLAs that you can serve up predictions quicker? >> The data warehousing has been about basic statistics. It's been a sequel that the language that is used is to get descriptive statistics. Tables with averages and medians, that's statistics. Why wouldn't you want to have advanced statistics which now does predictions on it. It just so happens that sequel is not the right interface for that. So it's going to be very natural that people who are already asking statistical questions for the last 30 years from their customer data, these massive throes of data that they have stored. Why wouldn't they want to also say, 'okay now give me more advanced statistics?' I'm not an expert on advanced statistics but you the system. Tell me what I should watch out for. Which of these customers do I talk to? Which of the products are in trouble? Which of the products are not, or which parts of my business are not doing well now? Predict the future for me. >> George: When you're doing that though, you're now doing it on data that has a fair amount of latency built into it. Because that's how it got into the data warehouse. Where if it's in the operational database, it's really low latency, typically low latency stuff. Where and why do you see that distinction? >> I do think also that we'll see more and more real-time engines take over. If you do things in real-time you can do it for a fraction of the cost. So we'll also see those capabilities come in. So you don't have to... Your question is, why would you want to once a week batch everything into a central warehouse and I agree with that. It will be streaming in live and then you can on that, do predictions, you can do basic analytics. I think basically the lines will blur between all these technologies that we're seeing. In some sense, Spark actually was the precursor to all that. So Spark already was unifying machine learning, sequel, ETL, real-time, and you're going to see that everywhere appear. >> You mentioned Shell as an example, one of your customers, you also had HP, Capital One, and you developed this unified analytics platform, that's solving some of their common problems. Now that you're in the mood to make predictions, what do you think are going to be the most compelling use cases or industries where you're going to see Databricks going in the future? >> That's a hard one. Right now, I think healthcare. There's a lot of data sets, there's a lot of gene sequencing data. They want to be able to use machine learning. In fact, I think those industries being transformed slowly from using classical statistics into machine learning. We've actually helped some of these companies do that. We've set up workshops and they've gotten people trained. And now they're hiring machine learning experts that are coming in. So that's one I think in the healthcare industry, whether it's for drug-testing, clinical-trials, even diagnosis, that's a big one, I do think industrial IT. These are big companies with lots of equipment, they have tons of sensor data, massive data sets. There's a lot of predictions that they can do on that. So that's a second one I would say. Financial industry, they've always been about predictions, so it makes a lot of sense that they continue doing that. Those are the biggest ones for Databricks. But I think now also as slowly, other verticals are moving into the cloud. We'll see more of other use cases as well. But those are the biggest ones I see right now. It's hard to say where it will be ten years from now, or 15. Things are going so fast that it's hard to even predict six months. >> David: Do you believe IOT is going to be a big business driver? >> Yes, absolutely. >> I want to circle back where you said that we've got different types of databases but we're going to unify the capabilities. Without saying, it's not like one wins, one loses. >> Ali: Yes, I didn't want to do that. >> So describe maybe the characteristics of what a database that compliments Sparks really well might look like. >> That's hard for me to say. The capabilities of Spark, I think, are here to stay. The ability to be able to ETL variety of data that doesn't have structure, so Structured Query Language, SQL, is not fit for it, that is really important and it's going to become more important since data is the new oil, as they said. Well, then it's going to be very important to be able to work with all kinds of data and getting that into the systems. There's more things everyday being created. Devices, IOT, whatever it is that are spewing out this data in different forms and shapes. So being able to work with that variety, that's going to be an important property. So they'll have to do that. That's the ETL portion or the ELT portion. The real-time portion, not having to do this in a batch manner once a week because now time is a competitive advantage. So if I'm one week behind you that means I'm going to lose out. So doing that in real-time, or near human-time or human real-time, that's going to be really important. So that's going to come as well, I think, and people will demand that. That's going to be a competitive advantage. Wherever you can add that secret sauce it's going to add value to the customers. And then finally the predictive stuff, adding the predictive stuff. But I think people will want to continue to also do all the old stuff they've been doing. I don't think that's going to go away. Those bring value to customers, they want to do all those traditional use cases as well. >> So what about now where customers expect to have some, not clear how much, un-Primmed application platform like Spark. Some in the cloud that now that you've totally reordered the TCO equation. But then also at the edge for IOT-type use cases, do you have to slim down Spark to work at the edge? If you have server-less working in the cloud, does that mean you have to change the management paradigm on Prim. What does that mix look like? How does someone, you know how does a Fortune 200 company, get their arms around that? >> Ali: Yeah, this is a surprising thing, most surprising thing for me in the last year, is how many of those Fortune 200's that I was talking to three years ago and they were saying 'no way, we're not going into the cloud. You don't understand the regulations that we are facing or the amount of data that we have.' Or 'we can do it better,' or 'the security requirements that we have, no one can match that.' To now, those very same companies are saying 'absolutely, we're going.' It's not about if, it's about when. Now I would be hard-pressed to find any enterprise that says 'no, we're not going to go, ever.' And some companies we've even seen go from the cloud to on Prim, and then now back. Because the prices are getting more competitive in the cloud. Because now there's three, at least, major players that are competing and they're well-funded companies. In some sense, you have ad money and office money and retail money being thrown at this problem. Prices are getting competitive. Very soon, most IT folks will realize, there's no way we can do this faster, or better, or more reliable secure ourselves. >> David: We've got just a minute to go here before the break so we're going to kind of wrap it up here. And we got over 3000 people here at Spark Summit so it's the Spark community. I want you to talk to them for a moment. What problems do you want them to work on the most? And what are we going to be talking about a year from now at this table? >> The second one is harder. So I think the Spark community is doing a phenomenal job. I'm not going to tell them what to do. They should continue doing what they are doing already which is integrating Spark in the ecosystem, adding more and more integrations with the greatest technologies that are happening out there. Continue the innovation and we're super happy to have them here. We'll continue it as well, we'll continue to host this event and look forward to also having a Spark Summit in Europe, and also the East Coast soon. >> David: Okay, so I'm not going to ask you to make any more predictions. >> Alright, excellent. >> David: Ali this is great stuffy today. Thank you so much for taking some time and giving us more insight after the keynote this morning. Good luck with the rest of the show. >> Thank you. >> Thanks, Ali. And thank you for watching. That's Ali Ghodsi CEO from Databricks. We are Spark Summit 2017 here, on the Cube. Thanks for watching, stay with us. (upbeat mustic)
SUMMARY :
Brought to you by Databricks. We have the CEO from Databricks, Ali Ghodsi, joining us. the keynote this morning with Databricks and you delivered. that you couldn't have before. But after the keynote, did you Yes the keynote was recently so on my way here Recap that real quickly for our audience here, and server-less pools you can just have into a bigger perspective, which is if you go back So from the very beginning, So that's the traditional interactive use case, But generating the prediction model is coming from you and the announcement yesterday showed you also and so you can either build better models It's been a sequel that the language that is used Where and why do you see that distinction? and then you can on that, do predictions, what do you think are going to be It's hard to say where it will be ten years from now, or 15. I want to circle back where you said So describe maybe the characteristics of what a database and getting that into the systems. does that mean you have to change or the amount of data that we have.' I want you to talk to them for a moment. and also the East Coast soon. David: Okay, so I'm not going to ask you Thank you so much for taking some time And thank you for watching.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George | PERSON | 0.99+ |
David | PERSON | 0.99+ |
HP | ORGANIZATION | 0.99+ |
Ali Ghodsi | PERSON | 0.99+ |
Europe | LOCATION | 0.99+ |
Ali | PERSON | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Capital One | ORGANIZATION | 0.99+ |
three | QUANTITY | 0.99+ |
Today | DATE | 0.99+ |
one week | QUANTITY | 0.99+ |
tomorrow | DATE | 0.99+ |
last year | DATE | 0.99+ |
ten years | QUANTITY | 0.99+ |
yesterday | DATE | 0.99+ |
three years | QUANTITY | 0.99+ |
3000 people | QUANTITY | 0.99+ |
One | QUANTITY | 0.99+ |
ten minutes | QUANTITY | 0.99+ |
Four years ago | DATE | 0.99+ |
three years ago | DATE | 0.99+ |
next decade | DATE | 0.99+ |
six months | QUANTITY | 0.99+ |
Yesterday | DATE | 0.98+ |
over 1000 people | QUANTITY | 0.98+ |
East Coast | LOCATION | 0.98+ |
today | DATE | 0.98+ |
one | QUANTITY | 0.98+ |
one prediction | QUANTITY | 0.98+ |
second prediction | QUANTITY | 0.98+ |
Silicon Valley | LOCATION | 0.97+ |
Spark Summit 2017 | EVENT | 0.97+ |
Spark | TITLE | 0.97+ |
once a week | QUANTITY | 0.97+ |
Sparks Summit | EVENT | 0.97+ |
Fortune 200 | ORGANIZATION | 0.96+ |
over 3000 people | QUANTITY | 0.96+ |
about a year and a half | QUANTITY | 0.95+ |
Shell | ORGANIZATION | 0.95+ |
Spark | ORGANIZATION | 0.95+ |
Sparks | TITLE | 0.94+ |
IOT | ORGANIZATION | 0.94+ |
day two | QUANTITY | 0.94+ |
Sparks Summit 2017 | EVENT | 0.94+ |
this morning | DATE | 0.93+ |
second one | QUANTITY | 0.93+ |
S3 | TITLE | 0.85+ |
one data scientist | QUANTITY | 0.85+ |
15 | QUANTITY | 0.85+ |
Saturday morning at | DATE | 0.84+ |
tons | QUANTITY | 0.83+ |
S3 | ORGANIZATION | 0.8+ |
one piece of the puzzle | QUANTITY | 0.79+ |
couple people | QUANTITY | 0.77+ |
Prim | ORGANIZATION | 0.76+ |
several years | QUANTITY | 0.75+ |
John Cavanaugh, HP - #SparkSummit - #theCUBE
>> Announcer: Live from San Francisco, it's theCube, covering Spark Summit 2017, brought to you by Databricks. >> Welcome back to theCube at Spark Summit 2017. I don't know about you, George, I'm having a great time learning from all of our attendees. >> We've been absorbing now for almost two days. >> Yeah, well, and we're about to absorb a little bit more here, too, because the next guest, I looking forward to, I saw his name on the schedule, all right, that's the guy who talks about herding cats, it's John Cavanaugh, Master Architect from HP. John, welcome to the show. >> Great, thanks for being here. >> Well, I did see, I don't know if it's about cats in the Internet, but either cats or self-driving cars, one of the two in analogies. But talk to us about your session. Why did you call it Herding Cats, and is that related to maybe the organization at HP? >> Yeah, there's a lot of organizational dynamics as part of our migration at Spark. HP is a very distributed organization, and it has had a lot of distributed autonomy, so, you know, trying to get centralized activity is often a little challenging. You guys have often heard, you know, I am from the government, I'm here to help. That's often the kind of shields-up response you will get from folks, so we got a lot of dynamics in terms of trying to bring these distributed organizations on board to a new common platform, and a allay many of the fears that they had with making any kind of a change. >> So, are you centered at a specific division? >> So, yes, I'm the print platforms and future technology group. You know, there's two large business segments with HP. There's our personal systems group that produces everything from phones to business PCs to high-end gaming. But I'm in the printing group, and while many people are very familiar with your standard desktop printer, you know, the printers we sell really vary from a very small product we call Sprocket, it fits in your hand, battery-operated, to literally a web press that's bigger than your house and prints at hundreds of feet per minute. So, it's a very wide product line, and it has a lot of data collection. >> David: Do you have 3D printing as well? >> We do have 3D printing as well. That's an emergent area for us. I'm not super familiar with that. I'm mostly on the 2D side, but that's a very exciting space as well. >> So tell me about what kind of projects that you're working on that do require that kind of cross-team or cross-departmental cooperation. >> So, you know, in my talk, I talked about the Wild West Era of Big Data, and that was prior to 2015, and we had a lot of groups that were standing up all kinds of different big data infrastructures. And part of this stems from the fact that we were part of HP at the time, and we could buy servers and racks of servers at cost. Storage was cheap, all these things, so they sprouted up everywhere. And, around 2015, everybody started realizing, oh my God, this is completely fragmented. How do we pull things back together? And that's when a lot of groups started trying to develop platformish types of activities, and that's where we knew we needed to go, but there was even some disagreement from different groups, how do we move forward. So, there's been a lot of good work within HP in terms of creating a virtual community, and Spark really kind of caught on pretty quickly. Many people were really tired of kind of Hadoop. There were a lot of very opinionated models in Hadoop, where Spark opens up a lot more into the data science community. So, that went really well, and we made a big push into AWS for much of our cloud activities, and we really ended up then pretty quickly with Databricks as an enterprise partner for us. >> And so, George, you've done a lot of research. I'm sure you talked to enterprise companies along the way. Is this a common issue with big enterprises? >> Well, for most big data projects they've started, the ones we hear a lot about is there's a mandate from the CIO, we need a big data strategy, and so some of those, in the past, stand up five or 10-node Hadoop cluster and run some sort of pilot and say, this is our strategy. But is sounds like you herded a lot of cats... >> We had dozens of those small Hadoop clusters all around the company. (laughter) >> So, how did you go about converting that energy, that excess energy towards something more harmonized around Databricks? >> Well, a lot of people started recognizing we had a problem, and this really wasn't going to scale, and we really needed to come up with a broader way to share things across the organization. So, the timing was really right, and a lot of people were beginning to understand that. And, you know, we said for us, probably about five different kind of key decisions we ended up making. And part of the whole strategy was to empower the businesses. As I have mentioned, we are a very distributed organization, so, you can't really dictate the businesses. The businesses really need the owners' success. And one of the decisions that was made, it might be kind of controversial for many CIOs, is that we've made a big push on cloud-hosted and business-owned, not IT-owned. And one of the real big reasons for that is we were no longer viewing data and big data as kind of a business-intelligence activity or a standardized reporting activity. We really knew that, to be successful moving forward, is needed to be built into our products and services, and those products and services are managed by the businesses. So, it can't be something that would be tossed off to an IT organization. >> So that the IT organization, then, evolved into being more of an innovative entity versus a reactive or supportive entity for all those different distributing groups. >> Well, in our regard, we've ended up with AWS as part of our activity, and, really, much of our big data activities are driven by the businesses. The connections we have with IT are more related to CRM and product data master sheets and selling in channels and all that information. >> But if you take a bunch of business-led projects and then try and centralize some aspect of them, wouldn't IT typically become the sort of shared infrastructure architecture advisor for that, and then the businesses now have a harmonized platform on which they can build shared data sets? >> Actually, in our case, that's what we did. We had a lot of our businesses that already had significant services hosted in AWS. And those were very much part of the high-data generators. So, it became a very natural evolution to continue with some of our AWS relationships and continue on to Databricks. So, as an organization today, we have three kind of main buckets for our Databricks, but, you know, any business, they can get their accounts. We try and encourage everything to get into a data link, and that's three, and Parquet formats, one of the decisions that was adapted. And then, from there, people can begin to move. You know, you can get notebooks, you can share notebooks, you can look at those things. You know, the beauty of Databricks and AWS is instant on. If I want to play around with something with a half a dozen nodes, it's great. If I need a thousand for a workload, boom, I've got it! I know, kind of others, then, with this cost and the value returned, there's really no need for permissions or coordination with other entities, and that's kind of what we wanted the businesses to have that autonomy to drive their business success. >> But, does there not to be some central value added in the way of, say, data curation through a catalog or something like that? >> Yes, so, this is not necessarily a model where all the businesses are doing all kinds of crazy things. One of the things that we shepherded by one of our CTOs and the other functions, we ended up creating a virtual community within HP. This kind of started off with a lot of "tribal elders" or "tribal leaders." With this virtual community, today we get together every two weeks, and we have presentations and discussions on all things from data science into machine learning, and that's where a lot of this activity around how do we get better at sharing. And this is fostered, kind of splinters off for additional activity. So we have one on data telemetry within our organization. We're trying to standardize more data formats and schemas for those so we can have more broader sharing. So, these things have been occurring more organically as part of a developer enablement kind of moving up rather than more of kind of dictates moving down. >> That's interesting. Potentially, really important, when you say, you're trying to standardize some of the telemetry, what are you instrumenting. Is it just all the infrastructure or is it some of the products that HP makes? >> It's definitely the products and the software. You know, like I said, we manage a huge spectrum of print products, and my apologies if I'm focusing on it, but that is what I know the best. You know, we've actually been doing telemetry and analysis since the late 90s. You know, we wanted to understand use of supplies and usage so we could do our own forecasting, and that's really, really grown over the years. You know, now, we have parts of our services organization management services, where they're offering big data analytics as part of the package, and we provide information about predictive failure of parts. And that's going to be really valuable for some of our business partners that allows them. We have all kinds fancy algorithms that we work on. The customers have specific routes that they go for servicing, and we may be able to tell them, hey, in a certain time period, we think these devices in your field so you can coordinate your route to hit those on an efficient route rather than having to make a single truck roll for one repair, and do that before a customer experiences a problem. So, it's been kind of a great example of different ways that big data can impact the business. >> You know, I think Ali mentioned in the keynote this morning about the example of a customer getting a notification that their ink's going to run out, and the chance that you get to touch that customer and get them to respond and buy, you could make millions of dollar difference, right? Let's talk about some of the business outcomes and the impact that some of your workers have done, and what it means, really, to the business. >> Right now, we're trying to migrate a lot of legacy stuff, and you know, that's kind of boring. (laughs) It's just a lot of work, but there are things that need to happen. But there's really the power of the big data platform has been really great with Databricks. I know, John Landry, one of our CTOs, he's in the personal systems group. He had a great example on some problems they had with batteries and laptops, and, you know, they have a whole bunch of analytics. They've been monitoring batteries, and they found a collection of batteries that experienced very early failure rates. I happen to be able to narrow it down to specific lots from a specific supplier, and they were able to reach out to customers to get those batteries replaced before they died. >> So, a mini-recall instead of a massive PR failure. (laughs) >> You know, it was really focused on, you know, customers didn't even know they were going to have a problem with these batteries, that they were going to die early. You know, you got to them ahead of time, told them we knew this was going to be a problem and try to help them. I mean, what a great experience for a customer. (laughs) That's just great. >> So, once you had this telemetry, and it sounds like a bunch of shared repositories, not one intergalactic one. What were some of the other use cases like, you know, like the battery predictive failure type scenarios. >> So, you know, we have some very large gaps, or not gaps, with different categories. We have clearly consumer products. You know, you sell millions and millions of those, and we have little bit of telemetry with those. I think we want to understand failures and ink levels and some of these other things. But, on our commercial web presses, these very large devices, these are very sensitive. These things are down, they have a big problem. So, these things are generating all kinds of data. All right, we have systems on a premise with customers that are alerting them to potential failures, and there's more and more activity going on there to understand predictive failure and predictive kind of tolerance slippages. I'm not super familiar with that business, but I know some guys that they've started introducing more sensors into products, specifically so they can get more data, to understand things. You know, slight variations in tensioning and paper, you know, these things that are running hundreds of feet per minute can have a large impact. So, I think that's really where we see more and more of the value coming from is being able to return that value back to the customer, not just help us make better decisions, but to get that back to the customer. You know, we're talking about expanding more customer-facing analytics in these cases, or we'll expose to customers some of the raw data, and they can build their own dashboards. Some of these industries have traditionally been very analog, so this move to digital web process and this mountain of data is a little new for them, but HP can bring a lot to the table in terms of our experience in computing and big data to help them with their businesses. >> All right, great stuff. And we just got a minute to go before we're done. I have two questions for you, the first is an easy yes/no question. >> John: Okay. >> Is Purdue going to repeat as Big 10 champ in basketball? >> Oh, you know, I don't know. (laughs) I hope so! >> We both went to Purdue. >> I'm more focused on the Warriors winning. (laughter) >> All right, go Warriors! And, the real question is, what surprised you the most? This is your first Spark Summit. What surprised you the most about the event? >> So, you know, you see a lot of Internet-born companies, and it's amazing how many people have just gone fully native with Spark all over the place, and it's a beautiful thing to see. You know, in larger enterprises, that transition doesn't happen like that. I'm kind of jealous. (laughter) We have a lot more things slug through, but the excitement here and all the things that people are working on, you know, you can only see so many tracks. I'm going to have to spend two days when I get back, just watching the videos on all of the tracks I couldn't attend. >> All right, Internet-born companies versus the big enterprise. Good luck herding those cats, and thank you for sharing your story with us today and talking a little bit about the culture there at HP. >> John: Thank you very much. >> And thank you all for watching this segment of theCube. Stay with us, we're still covering Spark Summit 2017. This is Day Two, and we're not done yet. We'll see you in a few minutes. (theCube jingle)
SUMMARY :
covering Spark Summit 2017, brought to you by Databricks. Welcome back to theCube at Spark Summit 2017. all right, that's the guy who talks about herding cats, and is that related to maybe the organization at HP? and a allay many of the fears that they had and it has a lot of data collection. I'm mostly on the 2D side, that you're working on and we had a lot of groups that were standing up I'm sure you talked to enterprise companies along the way. the ones we hear a lot about is all around the company. and we really needed to come up with So that the IT organization, then, evolved and selling in channels and all that information. and Parquet formats, one of the decisions that was adapted. One of the things that we shepherded or is it some of the products that HP makes? and that's really, really grown over the years. and the chance that you get to touch that customer a lot of legacy stuff, and you know, that's kind of boring. So, a mini-recall instead of a massive PR failure. You know, it was really focused on, you know, What were some of the other use cases like, you know, and we have little bit of telemetry with those. And we just got a minute to go before we're done. Oh, you know, I don't know. I'm more focused on the Warriors winning. And, the real question is, what surprised you the most? and it's a beautiful thing to see. and thank you for sharing your story with us today And thank you all for watching this segment of theCube.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George | PERSON | 0.99+ |
John | PERSON | 0.99+ |
John Cavanaugh | PERSON | 0.99+ |
David | PERSON | 0.99+ |
John Landry | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
HP | ORGANIZATION | 0.99+ |
San Francisco | LOCATION | 0.99+ |
2015 | DATE | 0.99+ |
millions | QUANTITY | 0.99+ |
two questions | QUANTITY | 0.99+ |
two days | QUANTITY | 0.99+ |
two | QUANTITY | 0.99+ |
Ali | PERSON | 0.99+ |
one | QUANTITY | 0.99+ |
first | QUANTITY | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
half a dozen | QUANTITY | 0.99+ |
Hadoop | TITLE | 0.99+ |
three | QUANTITY | 0.99+ |
late 90s | DATE | 0.98+ |
Warriors | ORGANIZATION | 0.98+ |
Spark Summit 2017 | EVENT | 0.98+ |
hundreds of feet | QUANTITY | 0.98+ |
dozens | QUANTITY | 0.98+ |
One | QUANTITY | 0.98+ |
both | QUANTITY | 0.96+ |
Spark Summit | EVENT | 0.96+ |
today | DATE | 0.96+ |
hundreds of feet per minute | QUANTITY | 0.94+ |
Spark | ORGANIZATION | 0.93+ |
single truck | QUANTITY | 0.93+ |
a thousand | QUANTITY | 0.92+ |
one repair | QUANTITY | 0.92+ |
five | QUANTITY | 0.91+ |
this morning | DATE | 0.89+ |
Purdue | ORGANIZATION | 0.87+ |
Sprocket | ORGANIZATION | 0.87+ |
2D | QUANTITY | 0.84+ |
Day Two | QUANTITY | 0.83+ |
every two weeks | QUANTITY | 0.81+ |
dollar | QUANTITY | 0.81+ |
three kind | QUANTITY | 0.81+ |
two large business segments | QUANTITY | 0.7+ |
Spark | TITLE | 0.69+ |
five different | QUANTITY | 0.64+ |
Herding Cats | ORGANIZATION | 0.64+ |
about | QUANTITY | 0.6+ |
10-node | QUANTITY | 0.57+ |
so many | QUANTITY | 0.51+ |
Purdue | EVENT | 0.48+ |
theCube | ORGANIZATION | 0.47+ |
Parquet | TITLE | 0.42+ |
eally | ORGANIZATION | 0.3+ |
10 | QUANTITY | 0.28+ |
Mark Grover & Jennifer Wu | Spark Summit 2017
>> Announcer: Live from San Francisco, it's the Cube covering Spark Summit 2017, brought to you by databricks. >> Hi, we're back here where the Cube is live, and I didn't even know it Welcome, we're at Spark Summit 2017. Having so much fun talking to our guests I didn't know the camera was on. We are doing a talk with Cloudera, a couple of experts that we have here. First is Mark Grover, who's a software engineer and an author. He wrote the book, "Dupe Application Architectures." Mark, welcome to the show. >> Mark: Thank you very much. Glad to be here. And just to his left we also have Jennifer Wu, and Jennifer's director of product management at Cloudera. Did I get that right? >> That's right. I'm happy to be here, too. >> Alright, great to have you. Why don't we get started talking a little bit more about what Cloudera is maybe introducing new at the show? I saw a booth over here. Mark, do you want to get started? >> Mark: Yeah, there are two exciting things that we've launched at least recently. There Cloudera Altus, which is for transient work loads and being able to do ETL-Like workloads, and Jennifer will be happy to talk more about that. And then there's Cloudera data science workbench, which is this tool that allows folks to use data science at scale. So, get away from doing data science in silos on your personal laptops, and do it in a secure environment on cloud. >> Alright, well, let's jump into Data Science Workbench first. Tell me a little bit more about that, and you mentioned it's for exploratory data science. So give us a little more detail on what it does. >> Yeah, absolutely. So, there was private beta for Cloudera Data Science Workbench earlier in the year and then it was GA a few months ago. And it's like you said, an exploratory data science tool that brings data science to the masses within an enterprise. Previously people used to have, it was this dichotomy, right? As a data scientist, I want to have the latest and greatest tools. I want to use the latest version of Python, the latest notebook kernel, and I want to be able to use R and Python to be able to crunch this data and run my models in machine learning. However, on the other side of this dichotomy are the IT organization of the organization, where if they want to make sure that all tools are compliant and that your clusters are secure, and your data is not going into places that are not secured by state of the art security solutions, like Kerberos for example, right? And of course if the data scientists are putting the data on their laptops and taking the laptop around to wherever they go, that's not really a solution. So, that was one problem. And the other one was if you were to bring them all together in the same solution, data scientists have different requirements. One may want to use Python 2.6. Another one maybe want to use 3.2, right? And so Cloudera Data Science Workbench is a new product that allows data scientists to visualize and do machine learning through this very nice notebook-like interface, share their work with the rest of their colleagues in the organization, but also allows you to keep your clusters secure. So it allows you to run against a Kerberized cluster, allows single sign on to your web interface to Data Science Workbench, and provides a really nice developer experience in the sense that My workflow and my tools and my version of Python does not conflict with Jennifer's version of Python. We all have our own docker and Kubernetes-based infrastructure that makes sure that we use the packages that we need, and they don't interfere with each other. We're going to go to Jennifer on Altus in just a few minutes, but George first give you a chance to maybe dig in on Data Science workshop. >> Two questions on the data science side: some of the really toughest nuts to crack have been Sort of a common environment for the collaborators, but also the ability to operationalize the models once you've sort of agreed on them, and manage the lifecycle across teams, you know? Like, challenger champion, promote something, or even before that doing the ab testing, and then sort of what's in production is typically in a different language from what, you know, it was designed in and sort of integrating it with the apps. Where is that on the road map? Cause no one really has a good answer for that. >> Yeah, that's an excellent question. In general I think it's the problem to crack these days. How do you productionalize something that was written by a data scientist in a notebook-like system onto the production cluster, right? And I think the part where the data scientist works in a different language than the language that's in production, I think that problem, the best I can say right now is to actually have someone rewrite that. Have someone rewrite that in the language you're going to make in production, right? I don't see that to be the more common part. I think the more widespread problem is even when the language is production, how do you go making the part that the data scientist wrote, the model or whatever that would be, into a prodution cluster? And so, Data Science Workbench in particular runs on the same cluster that is being managed by Cloudera manager, right? So this is a tool that you install, but that is available to you as a web server, as a web interface, and so that allows you to move your development machine learning algorithms from your data science workbench to production much more easier, because it's all running on the same hardware and same systems. There's no separate Cloudera managers that you have to use to manage the workbench compared to your actual cluster. >> Okay. A tangential question, but one of the, the difficulties of doing machine learning is finding all the training data and, and sort of data science expertise to sit with the domain expert to, you know, figure out proper model of features, things like that. One of the things we've seen so far from the cloud vendors is they take their huge datasets in terms of voice, you know, images. They do the natural language understanding, speech or rather text to speech, you know, facial recognition. Cause they have such huge datasets they can train on. We're hearing noises that they'd going to take that down to the more mundane statistical kind of machine learning algorithms, so that you wouldn't be, like, here's a algorithm to do churn, you know, go to town, but that they might have something that's already kind of pre-populated that you would just customize. Is that something that you guys would tackle, too? >> I can't speak for the road map in that sense, but I think some of that problem needs to be tackled by projects like Spark for example. So I think as the stack matures, it's going to raise the level of abstraction as time goes on. And I think whatever benefits Spark ecosystem will have will come directly to distributions like Cloudera. >> George: That's interesting. >> Yeah >> Okay >> Alright, well let's go to Jennifer now and talk about Altus a little bit. Now you've been on the Cube show before, right? >> I have not. >> Okay, well, familiar with your work. Tell us again, you're the product manager for Altus. What does it do, and what was the motivation to build it? >> Yeah, we're really excited about Cloudera Altus. So, we released Cloudera Altus in its first GA form in April, and we launched Cloudera Altus in a public environment in Strata London about two weeks ago, so we're really excited about this and we are very excited to now open this up to all of the customer base. And what it is is a platform as a service offering designed to leverage, basically, the agility and the scale of cloud, and make a very easy to use type of experience to expose Cloudera capacity for, in particular for data engineering type of workloads. So the end user will be able to very easily, in a very agile manner, get data engineering capacity on Cloudera in the cloud, and they'll be able to do things like ETL and large scale data processing, productionized machine learning workflows in the cloud with this new data engineering as a service experience. And we wanted to abstract away the cloud, and cluster operations, and make the end user a really, the end user experience very easy. So, jobs and workloads as first class objects. You can do things like submit jobs, clone jobs, terminate jobs, troubleshoot jobs. We wanted to make this very, very easy for the data engineering end user. >> It does sound like you've sort of abstracted away a lot of the infrastructure that you would associate with on-prem, and sort of almost make it, like, programmable and invisible. But, um, I guess my, one of my questions is when you put it in a cloud environment, when you're on-prem you have a certain set of competitors which is kind of restrictive, because you are the standalone platform. But when you go on the cloud, someone might say, "I want to use red shift on Amazon," or Snowflake, you know, as the MPP sequel database at the end of a pipeline. And it's not just, I'm using those as examples. There's, you know, dozens, hundreds, thousands of other services to choose from. >> Yes. >> What happens to the integrity of that platform if someone carves off one piece? >> Right. So, interoperability and a unified data pipeline is very important to us, so we want to make sure that we can still service the entire data pipeline all the way from ingest and data processing to analytics. So our team has 24 different open source components that we deliver in the CDH distribution, and we have committers across the entire stack. We know the application, and we want to make sure that everything's interoperable, no matter how you deploy the cluster. So if you deploy data engineering clusters through Cloudera Altus, but you deployed Impala clusters for data marks in the cloud through Cloudera Director or through any other format, we want all these clusters to be interoperable, and we've taken great pains in order to make everything work together well. >> George: Okay. So how do Altus and Sata Science Workbench interoperate with Spark? Maybe start with >> You want to go first with Altus? >> Sure, so, we, in terms of interoperability we focus on things like making sure there are no data silos so that the data that you use for your entire data lake can be consumed by the different components in our system, the different compute engines and different tools, and so if you're processing data you can also look at this data and visualize this data through Data Science Workbench. So after you do data ingestion and data processing, you can use any of the other analytic tools and then, and this includes Data Science Workbench. >> Right, and for Data Science Workbench runs, for example, with the latest version of Spark you could pick, the currently latest released version of Spark, Spark 2.1, Spark 2.2 is being boarded of course, and that will soon be integrated after its release. For example you could use Data Science Workbench with your flavor of Spark two's version and you can run PySpark or Scala jobs on this notebook-like interface, be able to share your work, and because you're using Spark Underneath the hood it uses yarn for resource management, the Data Science Workbench itself uses Docker for configuration management, and Kubernetes for resource managing these Docker containers. >> What would be, if you had to describe sort of the edge conditions and the sweet spot of the application, I mean you talked about data engineering. One thing, we were talking to Matei Zaharia and Ronald Chin about was, and Ali Ghodsi as well was if you put Spark on a database, or at least a, you know, sophisticated storage manager, like Kudu, all of a sudden there're a whole new class of jobs or applications that open up. Have you guys thought about what that might look like in the future, and what new applications you would tackle? >> I think a lot of that benefit, for example, could be coming from the underlying storage engine. So let's take Spark on Kudu, for example. The inherent characteristics of Kudu today allow you to do updates without having to either deal with the complexity of something like Hbase, or the crappy performance of dealing HDFS compactions, right? So the sweet spot comes from Kudu's capabilities. Of course it doesn't support transactions or anything like that today, but imagine putting something like Spark and being able to use the machine learning libraries and, we have been limited so far in the machine learning algorithms that we have implemented in Spark by the storage system sometimes, and, for example new machine learning algorithms or the existing ones could rewritten to make use of the update features for example, in Kudu. >> And so, it sounds like it makes it, the machine learning pipeline might get richer, but I'm not hearing that, and maybe this isn't sort of in the near term sort of roadmap, the idea that you would build sort of operational apps that have these sophisticated analytics built in, you know, where the analytics, um, you've done the training but at run time, you know, the inferencing influences a transaction, influences a decision. Is that something that you would foresee? >> I think that's totally possible. Again, at the core of it is the part that now you have one storage system that can do scans really well, and it can also do random reads and writes any place, right? So as your, and so that allows applications which were previously siloed because one appication that ran off of HDFS, another application that ran out of Hbase, and then so you had to correlate them to just being one single application that can use to train and then also use their trained data to then make decisions on the new transactions that come in. >> So that's very much within the sort of scope of imagination, or scope. That's part of sort of the ultimate plan? >> Mark: I think it's definitely conceivable now, yeah. >> Okay. >> We're up against a hard break coming up in just a minute, so you each get a 30-second answer here, so it's the same question. You've been here for a day and a half now. What's the most surprising thing you've learned that you thing should be shared more broadly with the Spark community? Let's start with you. >> I think one of the great things that's happening in Spark today is people have been complaining about latency for a long time. So if you saw the keynote yesterday, you would see that Spark is making forays into reducing that latency. And if you are interested in Spark, using Spark, it's very exciting news. You should keep tabs on it. We hope to deliver lower latency as a community sooner. >> How long is one millisecond? (Mark laughs) >> Yeah, I'm largely focused on cloud infrastructure and I found here at the conference that, like, many many people are very much prepared to actually start taking more, you know, more POCs and more interest in cloud and the response in terms of all of this in Altus has been very encouraging. >> Great. Well, Jennifer, Mark, thank you so much for spending some time here on the Cube with us today. We're going to come by your booth and chat a little bit more later. It's some interesting stuff. And thank you all for watching the Cube today here at Spark Summit 2017, and thanks to Cloudera for bringing us these two experts. And thank you for watching. We'll see you again in just a few minutes with our next interview.
SUMMARY :
covering Spark Summit 2017, brought to you by databricks. I didn't know the camera was on. And just to his left we also have Jennifer Wu, I'm happy to be here, too. Mark, do you want to get started? and being able to do ETL-Like workloads, and you mentioned it's for exploratory data science. And the other one was if you were to bring them all together and manage the lifecycle across teams, you know? and so that allows you to move your development machine the domain expert to, you know, I can't speak for the road map in that sense, and talk about Altus a little bit. to build it? on Cloudera in the cloud, and they'll be able to do things a lot of the infrastructure that you would associate with We know the application, and we want to make sure Maybe start with so that the data that you use for your entire data lake and you can run PySpark in the future, and what new applications you would tackle? or the existing ones could rewritten to make use the idea that you would build sort of operational apps Again, at the core of it is the part that now you have That's part of sort of the ultimate plan? that you thing should be shared more broadly So if you saw the keynote yesterday, you would see that and the response in terms of all of this on the Cube with us today.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jennifer | PERSON | 0.99+ |
Mark Grover | PERSON | 0.99+ |
Jennifer Wu | PERSON | 0.99+ |
Ali Ghodsi | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Mark | PERSON | 0.99+ |
April | DATE | 0.99+ |
Ronald Chin | PERSON | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Matei Zaharia | PERSON | 0.99+ |
30-second | QUANTITY | 0.99+ |
Cloudera | ORGANIZATION | 0.99+ |
Dupe Application Architectures | TITLE | 0.99+ |
dozens | QUANTITY | 0.99+ |
Python | TITLE | 0.99+ |
yesterday | DATE | 0.99+ |
Two questions | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
Spark | TITLE | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
two experts | QUANTITY | 0.99+ |
a day and a half | QUANTITY | 0.99+ |
First | QUANTITY | 0.99+ |
one problem | QUANTITY | 0.99+ |
Python 2.6 | TITLE | 0.99+ |
Strata London | LOCATION | 0.99+ |
one piece | QUANTITY | 0.99+ |
first | QUANTITY | 0.98+ |
Spark Summit 2017 | EVENT | 0.98+ |
Cloudera Altus | TITLE | 0.98+ |
Scala | TITLE | 0.98+ |
Docker | TITLE | 0.98+ |
One | QUANTITY | 0.97+ |
Kudu | ORGANIZATION | 0.97+ |
one millisecond | QUANTITY | 0.97+ |
PySpark | TITLE | 0.96+ |
R | TITLE | 0.95+ |
one | QUANTITY | 0.95+ |
two weeks ago | DATE | 0.93+ |
Data Science Workbench | TITLE | 0.92+ |
Cloudera | TITLE | 0.91+ |
hundreds | QUANTITY | 0.89+ |
Hbase | TITLE | 0.89+ |
each | QUANTITY | 0.89+ |
24 different open source components | QUANTITY | 0.89+ |
few months ago | DATE | 0.89+ |
single | QUANTITY | 0.88+ |
kernel | TITLE | 0.88+ |
Altus | TITLE | 0.88+ |
Reynold Xin, Databricks - #Spark Summit - #theCUBE
>> Narrator: Live from San Francisco, it's theCUBE, covering Spark Summit 2017. Brought to you by Databricks. >> Welcome back we're here at theCube at Spark Summit 2017. I'm David Goad here with George Gilbert, George. >> Good to be here. >> Thanks for hanging with us. Well here's the other man of the hour here. We just talked with Ali, the CEO at Databricks and now we have the Chief Architect and co-founder at Databricks, Reynold Xin. Reynold, how are you? >> I'm good. How are you doing? >> David: Awesome. Enjoying yourself here at the show? >> Absolutely, it's fantastic. It's the largest Summit. It's a lot interesting things, a lot of interesting people with who I meet. >> Well I know you're a really humble guy but I had to ask Ali what should I ask Reynold when he gets up here. Reynold is one of the biggest contributors to Spark. And you've been with us for a long time right? >> Yes, I've been contributing for Spark for about five or six years and that's probably the most number of commits to the project and lately more I'm working with other people to help design the roadmap for both Spark and Databricks with them. >> Well let's get started talking about some of the new developments that you want maybe our audience at theCUBE hasn't heard here in the keynote this morning. What are some of the most exciting new developments? >> So, I think in general if we look at Spark, there are three directions I would say we doubling down. One the first direction is the deep learning. Deep learning is extremely hot and it's very capable but as we alluded to earlier in a blog post, deep learning has reached sort of a mass produced point in which it shows tremendous potential but the tools are very difficult to use. And we are hoping to democratize deep learning and do what Spark did to big data, to deep learning with this new library called deep learning pipelines. What it does, it integrates different deep learning libraries directly in Spark and can actually expose models in sequel. So, even the business analysts are capable of leveraging that. So, that one area, deep learning. The second area is streaming. Streaming, again, I think that a lot of customers have aspirations to actually shorten the latency and increase the throughput in streaming. So, the structured streaming effort is going to be generally available and last month alone on Databricks platform, I think out customers processed three trillion records, last month alone using structured streaming. And we also have a new effort to actually push down the latency all the way to some millisecond range. So, you can really do blazingly fast streaming analytics. And last but not least is the SEQUEL Data Warehousing area, Data warehousing I think that it's a very mature area from the outset of big data point of view, but from a big data one it's still pretty new and there's a lot of use cases that's popping up there. And Spark with approaches like the CBO and also impact here in the database runtime with DBIO, we're actually substantially improving the performance and the capabilities of data warehousing futures. >> We're going to dig in to some of those technologies here in just a second with George. But have you heard anything here so far from anyone that's changed your mind maybe about what to focus on next? So, one thing I've heard from a few customers is actually visibility and debugability of the big data jobs. So many of them are fairly technical engineers and some of them are less sophisticated engineers and they have written jobs and sometimes the job runs slow. And so the performance engineer in me would think so how do I make the job run fast? The different way to actually solve that problem is how can we expose the right information so the customer can actually understand and figure it out themselves. This is why my job is slow and this how I can tweak it to make it faster. Rather than giving people the fish, you actually give them the tools to fish. >> If you can call that bugability. >> Reynold: Yeah, Debugability. >> Debugability. >> Reynold: And visibility, yeah. >> Alright, awesome, George. >> So, let's go back and unpack some of those kind of juicy areas that you identified, on deep learning you were able to distribute, if I understand things right, the predictions. You could put models out on a cluster but the really hard part, the compute intensive stuff, was training across a cluster. And so Deep Learning, 4J and I think Intel's BigDL, they were written for Spark to do that. But with all the excitement over some of the new frameworks, are they now at the point where they are as good citizens on Spark as they are on their native environments? >> Yeah so, this is a very interesting question, obviously a lot of other frameworks are becoming more and more popular, such as TensorFlow, MXNet, Theano, Keras and Office. What the Deep Learning Pipeline library does, is actually exposes all these single note Deep Learning tools as highly optimized for say even GPUs or CPUs, to be available as a estimator or like a module in a pipeline of the machine learning pipeline library in spark. So, now users can actually leverage Spark's capability to, for example, do hyper parameter churning. So, when you're building a machine learning model, it's fairly rare that you just run something once and you're good with it. Usually have to fiddle with a lot of the parameters. For example, you might run over a hundred experiments to actually figure out what is the best model I can get. This is where actually Spark really shines. When you combine Spark with some deep learning library be it BigDL or be it MXNet, be it TensorFlow, you could be using Spark to distribute that training and then do cross validation on it. So you can actually find the best model very quickly. And Spark takes care of all the job scheduling, all the tolerance properties and how do you read data in from different data sources. >> And without my dropping too much in the weeds, there was a version of that where Spark wouldn't take care of all the communications. It would maybe distribute the models and then do some of the averaging of what was done out on the cluster. Are you saying that all that now can be managed by Spark? >> In that library, Spark will be able to actually take care of picking the best model out of it. And there are different ways you an design how do you define the best. The best could be some average of some different models. The best could be just pick one out of this. The best could be maybe there's a tree of models that you classify it on. >> George: And that's a hyper parameter configuration choice? >> So that is actually building functionality in Sparks machine learning pipeline. And now what we're doing is now you can actually plug all those deep learning libraries directly into that as part of the pipeline to be used. Another maybe just to add, >> Yeah, yeah, >> Another really cool functionality of the deep learning pipeline is transfer learning. So as you said, deep learning takes a very long time, it's very computationally demanding. And it takes a lot of resources, expertise to train. But with transfer learning what we allow the customers to do is they can take an existing deep learning model as well train in a different domain and they we'd retrain it on a very small amount of data very quickly and they can adapt it to a different domain. That's how sort of the demo on the James Bond car. So there is a general image classifier that we train it on probably just a few thousand images. And now we can actually detect whether a car is James Bond's car or not. >> Oh, and the implications there are huge, which is you don't have to have huge training data sets for modifying a model of a similar situation. I want to, in the time we have, there's always been this debate about whether Sparks should manage state, whether it's database, key value store. Tell us how the thinking about that has evolved and then how the integration interfaces for achieving that have evolved. >> One of the, I would say, advantages of Spark is that it's unbiased and works with a variety of storage systems, be it Cassandra, be it Edgebase, be it HDFS, be is S3. There is a metadata management functionality in Spark which is the catalog of tables that customers can define. But the actual storage sits somewhere else. And I don't think that will change in the near future because we do see that the storage systems have matured significantly in the last few years and I just wrote blog post last week about the advantage of S3 over HDFS for example. The storage price is being driven down by almost a factor of 10X when you go to the cloud. I just don't think it makes sense at this point to be building storage systems for analytics. That said, I think there's a lot of building on top of existing storage system. There's actually a lot of opportunities for optimization on how you can leverage the specific properties of the underlying storage system to get to maximum performance. For example, how are you doing intelligent caching, how do you start thinking about building indexes actually against the data that's stored for scanned workloads. >> With Tungsten's, you take advantage of the latest hardware and where we get more memory intensive systems and now that the Catalyst Optimizer has a cost based optimizer or will be, and large memory. Can you change how you go about knowing what data you're managing in the underlying system and therefore, achieve a tremendous acceleration in performance? >> This is actually one area we invested in the DBIO module as part of Databricks Runtime, and what DBIO does, a lot of this are still in progress, but for example, we're adding some form of indexing capability to add to the system so we can quickly skip and prune out all the irrelevant data when the user is doing simple point look-ups. Or if the user is doing a scan heavy workload with some predicates. That actually has to do with how we think about the underlying data structure. The storage system is still the same storage system, like S3, but were adding actually indexing functionalities on top of it as part of DBIO. >> And so what would be the application profiles? Is it just for the analytic queries or can you do the point look-ups and updates in that sort of scenario too? >> So it's interesting you're talking about updates. Updates is another thing that we've got a lot of future requests on. We're actively thinking about how we will support update workload. Now, that said, I just want to emphasize for both use case of doing point look-ups and updates, we're still talking about in the context of analytic environment. So we would be talking about for example maybe bulk updates or low throughput updates rather than doing transactional updates in which every time you swipe a credit card, some record gets updated. That's probably more belongs on the transactional databases like Oracle or my SEQUEL even. >> What about when you think about people who are going to run, they started out with Spark on prem, they realize they're going to put much more of their resources in the cloud, but with IIOT, industrial IOT type applications they're going to have Spark maybe in a gateway server on the edge? What do you think that configuration looks like? >> Really interesting, it's kind of two questions maybe. The first is the hybrid on prem, cloud solution. Again, so one of the nice advantage of Spark is the couple of storage and compute. So when you want to move for example, workloads from one prem to the cloud, the one you care the most about is probably actually the data 'cause the compute, it doesn't really matter that much where you run it but data's the one that's hard to move. We do have customers that's leveraging Databricks in the cloud but actually reading data directly from on prem the reliance of the caching solution we have that minimize the data transfer over time. And is one route I would say it's pretty popular. Another on is, with Amazon you can literally give them just a show ball of functionality. You give them hard drive with trucks, the trucks will ship your data directly put in a three. With IOT, a common pattern we see is a lot of the edge devices, would be actually pushing the data directly into some some fire hose like Kinesis or Kafka or, I'm sure Google and Microsoft both have their own variance of that. And then you use Spark to directly subscribe to those topics and process them in real time with structured streaming. >> And so would Spark be down, let's say at the site level. if it's not on the device itself? >> It's a interesting thought and maybe one thing we should actually consider more in the future is how do we push Spark to the edges. Right now it's more of a centralized model in which the devices push data into Spark which is centralized somewhere. I've seen for example, I don't remember exact the use case but it has to do with some scientific experiment in the North Pole. And of course there you don't have a great uplink of all the data connecting transferring back to some national lab and rather they would do a smart parsing there and then ship the aggregated result back. There's another one but it's less common. >> Alright well just one minute now before the break so I'm going to give you a chance to address the Spark community. What's the next big technical challenge you hope people will work on for the benefit of everybody? >> In general Spark came along with two focuses. One is performance, the other one's ease of use. And I still think big data tools are too difficult to use. Deep learning tools, even harder. The barrier to entry is very high for office tools. I would say, we might have already addressed performance to a degree that I think it's actually pretty usable. The systems are fast enough. Now, we should work on actually make (mumbles) even easier to use. It's what also we focus a lot on at Databricks here. >> David: Democratizing access right? >> Absolutely. >> Alright well Reynold, I wish we could talk to you all day. This is great. We are out of time now. Want to appreciate you coming by theCUBE and sharing your insights and good luck with the rest of the show. >> Thank you very much David and George. >> Thank you all for watching here were at theCUBE at Sparks Summit 2017. Stay tuned, lots of other great guests coming up today. We'll see you in a few minutes.
SUMMARY :
Brought to you by Databricks. I'm David Goad here with George Gilbert, George. Well here's the other man of the hour here. How are you doing? David: Awesome. It's the largest Summit. Reynold is one of the biggest contributors to Spark. and that's probably the most number of the new developments that you want So, the structured streaming effort is going to be And so the performance engineer in me would think kind of juicy areas that you identified, all the tolerance properties and how do you read data of the averaging of what was done out on the cluster. And there are different ways you an design as part of the pipeline to be used. of the deep learning pipeline is transfer learning. Oh, and the implications there are huge, of the underlying storage system and now that the Catalyst Optimizer The storage system is still the same storage system, That's probably more belongs on the transactional databases the one you care the most about if it's not on the device itself? And of course there you don't have a great uplink so I'm going to give you a chance One is performance, the other one's ease of use. Want to appreciate you coming by theCUBE Thank you all for watching here were at theCUBE
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
Reynold | PERSON | 0.99+ |
Ali | PERSON | 0.99+ |
David | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
David Goad | PERSON | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
ORGANIZATION | 0.99+ | |
North Pole | LOCATION | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Reynold Xin | PERSON | 0.99+ |
last month | DATE | 0.99+ |
10X | QUANTITY | 0.99+ |
two questions | QUANTITY | 0.99+ |
three trillion records | QUANTITY | 0.99+ |
second area | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
last week | DATE | 0.99+ |
Spark | TITLE | 0.99+ |
Spark Summit 2017 | EVENT | 0.99+ |
first direction | QUANTITY | 0.99+ |
One | QUANTITY | 0.99+ |
James Bond | PERSON | 0.98+ |
Spark | ORGANIZATION | 0.98+ |
both | QUANTITY | 0.98+ |
first | QUANTITY | 0.98+ |
one | QUANTITY | 0.98+ |
Tungsten | ORGANIZATION | 0.98+ |
two focuses | QUANTITY | 0.97+ |
three directions | QUANTITY | 0.97+ |
one minute | QUANTITY | 0.97+ |
one area | QUANTITY | 0.96+ |
three | QUANTITY | 0.96+ |
about five | QUANTITY | 0.96+ |
DBIO | ORGANIZATION | 0.96+ |
six years | QUANTITY | 0.95+ |
one thing | QUANTITY | 0.94+ |
over a hundred experiments | QUANTITY | 0.94+ |
Oracle | ORGANIZATION | 0.92+ |
Theano | TITLE | 0.92+ |
single note | QUANTITY | 0.91+ |
Intel | ORGANIZATION | 0.91+ |
one route | QUANTITY | 0.89+ |
theCUBE | ORGANIZATION | 0.88+ |
Office | TITLE | 0.87+ |
TensorFlow | TITLE | 0.87+ |
S3 | TITLE | 0.87+ |
MXNet | TITLE | 0.85+ |
Eric Siegel, Predictive Analytics World - #SparkSummit - #theCUBE
>> Announcer: Live from San Francisco it's theCUBE Covering Spark Summit 2017, brought to you by Databricks. >> Welcome back to theCUBE. You are watching coverage of Spark Summit 2017. It's day two, we've got so many new guests to talk to today. We already learned a lot, right George? >> Yeah, I mean we had some, I guess, pretty high bandwidth conversations. >> Yes, well I expect we're going to have another one here too, because the person we have is the founder of Predictive Analytics World, it's Eric Siegel, Eric welcome to the show. >> Hey thanks Dave, thanks George. You go by Dave or David? >> Dave: Oh you can call me sir, and that would be. >> I was calling you, should I, can I bow? >> Oh no we are bowing to you, you're the author of the book, Predictive Analytics, I love the subtitle, the Power to Predict Who Will Click, Buy, Lie or Die. >> And that sums up the industry right? >> Right, so if people are new to the industry, that's sort of an informal definition of predictive analytics, basically also known as machine learning. Where you're trying to make predictions for each individual, whether it's a customer for marketing, a suspect for fraud or law enforcement, a voter for political campaigning, a patient for healthcare. So, in general it's on that level, it's a prediction for each individual. So how does data help make those predictions? And then you can only imagine just how many ways in which predicting on that level helps organizations improve all their activities. >> Well we know you were on the keynote stage this morning. Could you maybe summarize for the CUBE audience, what a couple of the top themes that you were talking about? >> Yeah, I covered two advanced topics. I wanted to make sure this pretty technical audience was aware of because a lot of people aren't and one is called uplift modeling, so that's optimizing for persuasion for things like marketing and also for healthcare, actually. And for political campaigning. So when you do predictive analytics for targeting marketing normally sort of the traditional approach is, let's predict will this person buy if I contact them because when well its okay maybe its a good idea to spend the two dollars to send them a brochure its marketing treatment, right. But there is actually a little bit different question that would make even driving them better decisions Which is not will this person buy but would contacting them, sending them the brochure, influence them to buy, will it increase the chance that we get that positive outcome. That's a different question, and it doesn't correspond with standard predictive modeling or machine learning methods So uplift modeling, also known as net lift modeling, persuasion modeling its a way to actually create a predictive model like any other except that it's target is, is it a good idea to contact this person because it will increase the chances that they are going to have a positive outcome. So that's the first of the two. And I cram this all in 20 minutes. The other one was a little more commonly known But I think people would like to visit it and it's called P-Hacking or vast search. Where you can be fooled by randomness and data relatively easily in the era of Big Data there is this all to common pitfall where you find a predictive insight in the data and it turns out it was actually just a random perturbation. How do you know the difference? >> Dave: Fake news right? >> Okay fake news, except that in this case, it was generated by a computer, right? And then there is a statistical test that makes it look like its actually statistically significant and we should have credibility to it, on it or about it. So you can avert it, you have compensate for the fact that you are trying lots, that you are evaluating many different predictive insights or hypotheses whatever you want to call it and make sure that the one that you are believing you sort of checked for the ability that it wasn't just random luck, that's known as p-hacking. >> Alright, so uplift modeling and p-hacking. George do you want to drill on those a little bit. >> Yeah, I want to start from maybe the vocabulary of our audience where they say sort of like uplift modeling goes beyond prediction. Actually even for the second one with p-hacking is that where you're essentially playing with the parameters of the model to find the difference between correlation and causation and going from prediction to prescription? >> It's not about causation, its actually so correlation is what you get when you get a predictive insight or some component of a predictive model where you see these things connected therefore one is predictive of the other. Now the fact that does not entail causation is a really good point to remind people of as such. But even before you address that question, the first question is this correlation actually legit? Is there really a correlation between this things? Is this an actual finding? Or is it just happened to be the case in this particular sample of limited sample data that I have access to at the moment, right? So is it a real link or correlation in the first place before you even start asking any question about causality and it does have, it does related to what you alluded to with regard to tuning parameters because its closely related to this issue of overfitting. People who do predictive modeling are very familiar with overfitting. The standard practice all tools implementations of machine learning and predictive modeling do this, which is they hold the side evaluation set called test set. So you don't get to cheat, creates a predict model. It learns from the data, does the number crunching, its mostly automated, right. And it comes out with this beautiful model that does well predicting and then you evaluate, you assess it over this held aside. Oh my thing's falling off here. >> Dave: Just second on your. >> See then you evaluate it on this held aside set it was quarantine so you didn't get to cheat. You didn't get to look at it when you are creating the model. So it serves as an objective performance measure. The problem is and here is the huge irony, the things that we get from data, the predictive insights, there was one famous one that was broadcasted too loudly because its not nearly as credible as they first thought. Is that an orange used car is a better one to buy because its less likely to be a lemon. That's what it looked like in this one data set. The problem is, that when you have a single insight where its relatively simple, just talking about the car, the color to make the prediction. A predictive model is much more complex and deals with lots of other attributes not just the color, for example, make, year, model everything on that individual car, individual person, you can imagine all the attributes that's the point of the modeling process, the learning process, how do you consider multiple things. If its just a really simple thing with just based on the car color, then many of even the most advanced data science practitioners kind of forget that there is still potential to effectively overfit, that you might have found something that doesn't apply in general, only applies over this particular set of data. So that's where the trap falls and they don't necessarily hold themselves a high standard of having this held aside test set. So its kind of ironic thing, the things that most likely to make the headlines like orange cars are simpler, easier to understand, but are less well understood that they could be wrong. >> You know keying off that, that's really interesting, because we've been hearing for years that what's made, especially deep learning relevant over the last few years is huge compute up in the cloud and huge data sets. >> Yeah. >> But we're also starting to hear about methods of generating a sort of synthetic data so that if you don't have, I don't know what the term is, organic training data, and then test data, we're getting to the point where we can do high quality models with less. >> Yes, less of that training data. And did you. >> Tell us. >> Did you interview with the keynote speaker from Stanford about that? >> No, I only saw part of his. >> Yeah his speech yesterday. That's an area that I'm relatively new to but it sounds extremely important because that is the bottleneck. He called it, if data's the new oil, he's calling it the new-new oil. Which is more specific than data, it's training data. So all of the machine learning or predictive modeling methods of which we speak, are, in most cases, what's called supervised learning. So the thing that makes it supervised is you have a bunch of examples where you already know the answer. So you're trying to figure out is this picture of a cat or of a dog, that means you need to have a whole bunch of data from which to learn, the training data, where you've already got it labeled. You already know the correct answer. In many business applications just because of history you know who did or didn't respond to your marketing, you know who did or did not turn out to be fraudulent. History is experience in which to learn, it's in the data, so you do have that labeled, yes, no, like you already know the answer, you don't need to predict on them, it's in the past but you use that as training data. So we have that in many cases. But for something like classifying an image, and we're trying to figure out does this have a picture of a cat somewhere in the image, or whatever all these big image classification problems, you do need, often, a manual effort to label the data. Have the positive and negative examples, that's what's called training data, the learning data. It's actually called training data. There's definitely a bottleneck so anything that can be done to avert that bottleneck decrease the amount that we need, or find ways to make, sort of, rough training data that may serve as a building block for the modeling process this kind of thing. That's not my area of expertise, sounds really intriguing though. >> What about, and this may be further out on the horizon but one thing we are hearing about is the extreme shortage of data scientists who need to be teamed up with domain experts to figure out the knobs, the variables to create these elaborate models. We're told that even if you're doing the traditional, statistical, machine learning models, that eventually deep learning can help us identify the features or the variables just the way they sort of identify you know ears and whiskers and a nose and then figure out from that the cat. That's something that's in the near term, the medium term in terms of helping to augment what the data scientist does? >> It's in the near term and that's why everyone's excited about deep learning right now is that, basically the reason we built these machines called computers is because they automate stuff. Pretty much anything that you can think of and define well, you can program. Then you've got a machine that does it. Of course one of the things we wanted to learn, to do actually, is to learn from data. Now, it's literally really very analogous to what it means for a human to learn. You've got a limited number of examples that you're trying to draw generalizations from those. When you go to bigger scale problems where the thing you're classifying isn't just like a customer, and all the things you know about the customer, are they likely to commit fraud, yes or no. But it become a level more complex when it's an image right, image is worth a thousand words. And maybe literally more than a thousand words where it says of data if it's a high resolution. So how do you process that? Well there's all sorts of research like well we can define the thing that tries to find arcs, and circles and edges and this kind of thing, or, we can try to, once again, let that be automatic. Let the computer do that. So deep learning is a way to allow, spark is a way to make it operate quickly but there's another level of scale other than speed. The level of scale is just like how complex of a task can you leave up to the automaton, to go by itself. That's what deep learning does is it scales in that respect it has the ability to automate more layers of that complexity as far as finding those kinds of what might me domain specific features and images. >> Okay, but I'm thinking not just the, help me figure out speech to text and natural language understanding or classify. >> Anything with a signal where it's a high bandwidth amount of data coming in that you want to classify. >> OK, so could that, does that extend to I'm building a very elaborate predictive model not on, is there a cat in the video or in the picture so much as I guess you called it, is there an uplift potential and how big is that potential, in a context of making a sale on an eCommerce site. >> So what you just tapped into was when you go to marketing and many other business applications, you don't actually need to have high accuracy what you have to do is have a prediction that's better than guessing. So for example, if I get a 1% response rate to my marketing campaign, but I can find a pocket that's got 3% response rate, it may be very much rocket science to define and learn from the data how to define that specifically defined sub-segment that has a higher response rate, or whatever it is. But the 3% isn't like, I have high confidence this person's definitely going to buy, it's still just 3%, but that difference can make a huge difference and can improve the bottom line marketing by a factor of five and that kind of thing. It's not necessarily about accuracy. If you've got an image and you need to know is there a picture of a car, or is this traffic light green or red, somewhere in this image, then there's certain application areas, self driving cars what have you, it does need to be accurate right. But maybe there's more potential for it to be accurate because there's more predictability inherent to that problem. Like I can predict that there's a traffic light that has a green light somewhere in an image because there is enough label data and the nature of the problem is more tractable because it's not as challenging to find where the traffic light is, and then which color it is. You need it to scale, to reach that level of classification performance in terms of accuracy or whatever measure you use for certain applications. >> Are you seeing like new methodologies like reinforcement learning or deep learning where the models are adversarial where they make big advances in terms of what they can learn without a lot of supervision? Like the ones where. >> It's more self learning and unsupervised. >> Sort of glue yourself onto this video game screen we'll give you control of the steering wheel and you figure out how to win. >> Having less required supervision, more self-learning, anomaly detection or clustering, these are some of the unsupervised ones. When it comes to vision there are part of the process that can be unsupervised in the sense that you don't need labels on your target like is there a car in the picture. But it can still learn the feature detection in a way that doesn't have that supervised data. Although that image classification in general, on that level deep learning, is not my area of expertise. That's a very up and coming part of machine learning but it's only needed when you have these high bandwidth inputs like an entire image, high resolution, or a video, or a high bandwidth audio. So it's signal processing type problems where you start to need that kind of deep learning. >> Great discussion Eric, just a couple of minutes to go in this segment here. I want to make sure I give a chance to talk about Predictive Analytics World and what's your affiliation with that ad what do you want theCUBE audience to know? >> Oh sure, Predictive Analytics World I'm the founder it's the leading cross-vendor event focused on commercial deployment of predictive analytics and machine learning. Our main event a few times a year is a broad scope business focused event but we also have industry vertical focused specialized events just for financial services, healthcare, workforce, manufacturing and government applications of predictive analytics and machine learning. So there's a number a year, and two weeks from now in Chicago, October in New York and you can see the full agendas at PredictiveAnalyticsWorld.com. >> Alright great short commercial there. 30 seconds. >> It's the elevator pitch. >> Answered the toughest question in 30 seconds what the toughest question you got after your keynote this morning? Maybe a hallway conversation or. >> What's the toughest question I got after my keynote? >> Dave: From one of the attendees. >> Oh, the question that always comes up is how do you get this level of complexity across to non-technical people or your boss or your colleagues or your friends and family. By the way that's something I worked really hard on with the book which is meant for all readers although the last few chapters have. >> How do you get executive sponsors to get what you're doing? >> Well, as I say, give them the book. Because the point of the book is it's pop science it's accessible, it's analytically driven, it's entertaining it keeps it relevant but it does address advanced topics at the end of the book. So it sort of ends, industry overview kind of thing. The bottom line there, in general, is that you want to focus on the business impact. What I mentioned briefly a second ago if we can improve target marketing this much it will increase profit by a factor five something like that. So you start with that and then answer any questions they have about, well how does it work, what makes it credible that it really has that much potential in the bottom line. When you're a techie, you're inclined to go forward you start with the technology that you're excited about. That's my background, so that's sort of the definition of being a geek, that you're ore enamored with the technology than the value it produces. Because it's amazing that it works, and it's exciting, it's interesting, it's scientifically challenging. But, when you're talking to the decision makers you have to start with the eventual carrot at the end of the stick, which is the value. >> The business outcome. >> Yeah. >> Great, well that's going to be the last word. That might even make it onto our CUBE Gems segment, great sound bites. George thanks again, great questions and Eric the author of Predictive Analytics, the Power to Predict Who Will Click, Buy, Lie or Die. Thank you for being on the show we appreciate your time. >> Eric: Sure, yeah thank you, great to meet you. >> Thank you for watching theCUBE we'll be back in just a few minutes with our next guest here at Spark Summit 2017.
SUMMARY :
brought to you by Databricks. to talk to today. Yeah, I mean we had some, I guess, because the person we have is the founder You go by Dave or David? I love the subtitle, the Power to Predict Who Will Click, And then you can only imagine just how many ways what a couple of the top themes that you were talking about? there is this all to common pitfall where you find and make sure that the one that you are believing George do you want to drill on those a little bit. is that where you're essentially of a predictive model where you see these things connected The problem is, that when you have a single insight over the last few years is huge compute up in the cloud so that if you don't have, I don't know what the term is, Yes, less of that training data. it's in the data, so you do have that labeled, That's something that's in the near term, the medium term and all the things you know about the customer, help me figure out speech to text that you want to classify. so much as I guess you called it, So what you just tapped into was Are you seeing like new methodologies like and unsupervised. and you figure out how to win. that you don't need labels on your target ad what do you want theCUBE audience to know? in Chicago, October in New York and you can see what the toughest question you got is how do you get this level of complexity is that you want to focus on the business impact. and Eric the author of Predictive Analytics, the Power Thank you for watching theCUBE we'll be back
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Eric Siegel | PERSON | 0.99+ |
David | PERSON | 0.99+ |
Eric | PERSON | 0.99+ |
Chicago | LOCATION | 0.99+ |
1% | QUANTITY | 0.99+ |
two dollars | QUANTITY | 0.99+ |
3% | QUANTITY | 0.99+ |
San Francisco | LOCATION | 0.99+ |
New York | LOCATION | 0.99+ |
30 seconds | QUANTITY | 0.99+ |
yesterday | DATE | 0.99+ |
first question | QUANTITY | 0.99+ |
20 minutes | QUANTITY | 0.99+ |
Predictive Analytics | TITLE | 0.99+ |
Spark Summit 2017 | EVENT | 0.99+ |
more than a thousand words | QUANTITY | 0.98+ |
Predictive Analytics World | ORGANIZATION | 0.98+ |
first | QUANTITY | 0.98+ |
one | QUANTITY | 0.98+ |
second one | QUANTITY | 0.98+ |
each individual | QUANTITY | 0.98+ |
two | QUANTITY | 0.98+ |
today | DATE | 0.97+ |
second | QUANTITY | 0.97+ |
October | DATE | 0.97+ |
two weeks | QUANTITY | 0.97+ |
two advanced topics | QUANTITY | 0.97+ |
first place | QUANTITY | 0.96+ |
the Power to Predict Who Will Click, Buy, Lie or Die | TITLE | 0.94+ |
Predictive Analytics, the Power to Predict Who Will Click, Buy, Lie or Die | TITLE | 0.94+ |
Databricks | ORGANIZATION | 0.94+ |
single insight | QUANTITY | 0.93+ |
Stanford | ORGANIZATION | 0.91+ |
five | QUANTITY | 0.9+ |
this morning | DATE | 0.87+ |
CUBE | ORGANIZATION | 0.86+ |
a thousand words | QUANTITY | 0.84+ |
first thought | QUANTITY | 0.82+ |
Predictive Analytics | ORGANIZATION | 0.77+ |
a year | QUANTITY | 0.72+ |
theCUBE | ORGANIZATION | 0.72+ |
day two | QUANTITY | 0.7+ |
one famous | QUANTITY | 0.69+ |
PredictiveAnalyticsWorld.com | ORGANIZATION | 0.66+ |
times a year | QUANTITY | 0.66+ |
second ago | DATE | 0.66+ |
World | EVENT | 0.63+ |
#theCUBE | ORGANIZATION | 0.57+ |
years | QUANTITY | 0.56+ |
last | DATE | 0.56+ |
factor | QUANTITY | 0.52+ |
years | DATE | 0.49+ |
minutes | QUANTITY | 0.48+ |
five | OTHER | 0.33+ |
Ash Munshi, Pepperdata - #SparkSummit - #theCUBE
(upbeat music) >> Announcer: Live from San Francisco, it's theCUBE, covering Spark Summit 2017, brought to you by Databricks. >> Welcome back to theCUBE, it's day two at the Spark Summit 2017. I'm David Goad and here with George Gilbert from Wikibon, George. >> George: Good to be here. >> Alright and the guest of honor of course, is Ash Munshi, who is the CEO of Pepperdata. Ash, welcome to the show. >> Thank you very much, thank you. >> Well you have an interesting background, I want you to just tell us real quick here, not give the whole bio, but you got a great background in machine learning, you were an early user of Spark, tell us a little bit about your experience. >> So I'm actually a mathematician originally, a theoretician who worked for IBM Research, and then subsequently Larry Ellison at Oracle, and a number of other places. But most recently I was CTO at Yahoo, and then subsequent to that I did a bunch of startups, that involved different types of machine learning, and also just in general, sort of a lot of big data infrastructure stuff. >> And go back to 2012 with Spark right? You had an interesting development. Right, so 2011, 2012, when Spark was still early, we were actually building a recommendation system, based on user-generated reviews. That was a project that was done with Nando de Freitas, who is now at DeepMind, and Peter Cnudde, who's one of the key guys that runs infrastructure at Yahoo. We started that company, and we were one of the early users of Spark, and what we found was, that we were analyzing all the reviews at Amazon. So Amazon allows you to crawl all of their reviews, and we basically had natural language processing, that would allow us to analyze all those reviews. When we were doing sort of MapReduce stuff, it was taking us a huge number of nodes, and 24 hours to actually go do analysis. And then we had this little project called Spark, out of AMPlab, and we decided spin it up, and see what we could do. It had lots of issues at that time, but we were able to actually spin it up on to, I think it was in the order of 100,000 nodes, and we were able take our times for running our algorithms from you know, sort of tens of hours, down to sort of an hour or two, so it was a significant improvement in performance. And that's when we realized that, you know, this is going to be something that's going to be really important once this set of issues, where it, once it was going to get mature enough to make happen, and I'm glad to see that that it's actually happened now, and it's actually taken over the world. >> Yeah that little project became a big deal, didn't it? >> It became a big deal, and now everybody's taking advantage of the same thing. >> Well bring us to the present here. We'll talk about Pepperdata and what you do, and then George is going to ask a little bit more about some of the solutions that you have. >> Perfect, so Pepperdata was a company founded by two gentlemen, Sean Suchter and Chad Carson. Sean used to run Yahoo Search, and one of the first guys who actually helped develop Hadoop next to Eric14 and that team. And then Chad was one of the first guys who actually figured out how to monetize clicks, and was the data science guy around the whole thing. So those are the two guys that actually started the company. I joined the company last July as CEO, and you know, what we've done recently, is we've sort of expanded our focus of the company to addressing DevOps for big data. And the reason why DevOps for big data is important, is because what's happened in the last few years, is people have gone from experimenting with big data, to taking big data into production, and now they're actually starting to figure out how to actually make it so that it actually runs properly, and scales, and does all the other kinds of things that are there, right? So, it's that transition that's actually happened, so, "Hey, we ran it in production, "and it didn't quite work the way we wanted to, "now we actually have to make it work correctly." That's where we sort of fit in, and that's where DevOps comes in, right? DevOps comes in when you're actually trying to make production systems that are going to perform in the right way. And the reason for DevOps is it shortens the cycle between developers and operators, right? So the tighter the loop, the faster you can get solutions out, because business users are actually wanting that to happen. That's where we're squarely focused, is how do we make that work? How do we make that work correctly for big data? And the difference between, sort of classic DevOps and DevOps for big data, is that you're now dealing with not just, you know, a set of computers solving an isolated sort of problem. You're dealing with thousands of machines that are solving one problem, and the amount of data is significantly larger. So the classical methodologies that you have, while, you know, agile and all that still works, the tools don't work to actually figure out what you can do with DevOps, and that's where we come in. We've got a set of tools that are focused on performance effectively, 'cause that's the big difference between distributed systems performance I should say, that's the big difference between that, and sort of classic even scaled out computing, right? So if you've got web servers, yes performance is important, and you need data for those, but that can actually be sharded nicely. This is one system working on one problem, right? Or a set of systems working on one problem. That's much harder, it's a different set of problems, and we help solve those problems. >> Yeah, and George you look like you're itching to dig into this, feel free. (exclaims loudly) >> Well so, it was, so one of the big announcements at the show, and the sort of the headline announcement today, was Spark server lists, like so it's not just someone running Spark in the cloud sort of as a manage service, it's up there as a, you know, sort of SaaS application. And you could call it platform of the service, but it's basically a service where, you know, the infrastructure is invisible. Now, for all those customers who are running their own clusters, which is pretty much everyone I would imagine at this point, how far can you take them in hiding much of the overhead of running those clusters? And by the overhead I mean, you know, the primarily performance and maximizing, you know, sort of maximizing resource efficiency. >> So, you have to actually sort of double-click on to the kind of resources that we're talking about here, right? So there's the number of nodes that you're going to need to actually do the computation. There is, you know, the amount of disc storage and stuff that you're going to need, what type of CPUs you're going to need. All of that stuff is sort of part of the costing if you will, of running an infrastructure. If somebody hides all that stuff, and makes it so that it's economical, then you know, that's a great thing, right? And if it can actually be made so that it's works for huge installations, and hides it appropriately so I don't pay too much of a tax, that's a wonderful thing to do. But we have, our customers are enterprises, typically Fortune 200 enterprises, and they have both a mixture of cloud-based stuff, where they actually want to control everything about what's going on, and then they have infrastructure internally, which by definition they control everything that's going on, and for them we're very, very applicable. I don't know how we'd applicable in this, sort of new world as a service that grows and shrinks. I can certainly imagine that whoever provides that service would embed us, to be able to use the stuff more efficiently. >> No, you answered my question, which is, for the people who aren't getting the turnkey you know, sort of SaaS solution, and they need help managing, you know, what's a fairly involved stack, they would turn to you? >> Ash: Yes. >> Okay. >> Can I ask you about the specific products? >> George: Oh yes. >> I saw you at the booth, and I saw you were announcing a couple of things. Well what is new-- >> Ash: Correct. >> With the show? >> Correct, so at the show we announced Code Analyzer for Apache Spark, and what that allows people to do, is really understand where performance issues are actually happening in their code. So, one of the wonderful things about Spark, compared to MapReduce, is that it abstracts the paradigm that you actually write against, right? So that's a wonderful thing, 'cause it makes it easier to write code. The problem when we abstract, is what does that abstraction do down in the hardware, and where am I losing performance? And being able to give that information back to the user. So you know, in Spark, you have jobs that can run in parallel. So an apps consists of jobs, jobs can run in parallel, and each one of these things can consume resources, CPU, memory, and you see that through sort of garbage collection, or a disc or a network, and what you want to find out, is which one these parallel tasks was dominating the CPU? Why was it dominating the CPU? Which one actually caused the garbage collector actually go crazy at some point? While the Spark UI provides some of that information, what it doesn't do, is gives you a time series view of what's going on. So it's sort of a blow-by-blow view of what's going on. By imposing the time series view on sort of an enhanced version of the Spark UI, you now have much better visibility about which offending stages are causing the issue. And the nice thing about that is, once you know that, you know exactly which piece of code that you actually want to go and look at. So classic example would be, you might have two stages that are running in parallel. The Spark UI will tell you that it's stage three that's causing the problem, but if you look at the time series, you'll find out that stage two actually runs longer, and that's the one that's pegging the CPU. And you can see that because we have the time series, but you couldn't see that any other way. >> So you have a code analyzer and also the app profiler. >> So the app profiler is the other product that we announced a few months ago. We announced that I guess about three months ago or so. And the app profiler, what it does, is it actually looks after the run is done, it actually looks at all the data that the run produces, so the Spark history server produces, and then it actually goes back and analyzes that and says, "Well you know what? "You're executors here, are not working as efficiently, "these are the executors "that aren't working as efficiently." It might be using too much memory or whatever, and then it allows the developer to basically be able to click on it and say, "Explain to me why that's happening?" And then it gives you a little, you know, a little fix-it if you will. It's like, if this is happening, you probably want to do these things, in order to improve performance. So, what's happening with our customers, is our customers are asking developers to run the application profiler first, before they actually put stuff on production. Because if the application profiler comes back and says, "Everything is green." That there's no critical issues there. Then they're saying, "Okay fine, put it on my cluster, "on the production cluster, "but don't do it ahead of time." The application profiler, to be clear, is actually based on some work that, on open source project called Dr. Elephant, which comes out of LinkedIn. And now we're working very closely together to make sure that we actually can advance the set of heuristics that we have, that will allow developers to understand and diagnose more and more complex problems. >> The Spark community has the best code names ever. Dr. Elephant, I've never heard of that one before. (laughter) >> Well Dr. Elephant, actually, is not just the Spark community, it's actually also part of the MapReduce community, right? >> David: Ah, okay. >> So yeah, I mean remember Hadoop? >> David: Yes. >> The elephant thing, so Dr. Elephant, and you know. >> Well let's talk about where things are going next, George? >> So, you know, one of the things we hear all the time from customers and vendors, is, "How are we going to deal with this new era "of distributed computing?" You know, where we've got the cloud, on-prem, edge, and like so, for the first question, let's leave out the edge and say, you've got your Fortune 200 client, they have, you know, production clusters or even if it's just one on-prem, but they also want to work in the cloud, whether it's for elastics stuff, or just for, they're gathering a lot of data there. How can you help them manage both, you know, environments? >> Right, so I think there's a bunch of times still, before we get into most customers actually facing that problem. What we see today is, that a lot of the Fortune 200, or our customers, I shouldn't say a lot of the Fortune 200, a lot of our customers have significant, you know, deployments internally on-prem. They do experimentation on the cloud, right? The current infrastructure for managing all these, and sort of orchestrating all this stuff, is typically YARN. What we're seeing, is that more than likely they're going to wind up, or at least our intelligence tells us that it's going to wind up being Kubernetes that's actually going to wind up managing that. So, what will happen is-- >> George: Both on-prem and-- >> Well let me get to that, alright? >> George: Okay. >> So, I think YARN will be replaced certainly on-prem with Kupernetes, because then you can do multi data center, and things of that sort. The nice thing about Kupernetes, is it in fact can span the cloud as well. So, Kupernetes as an infrastructure, is certainly capable of being able to both handle a multi data center deployment on-prem, along with whatever actually happens on the cloud. There is infrastructure available to do that. It's very immature, most of the customers aren't anywhere close to being able to do that, and I would say even before Kupernetes gets accepted within the environment, it's probably 18 months, and there's probably another 18 months to two years, before we start facing this hybrid cloud, on-prem kind of problem. So we're a few years out I think. >> So, would, for those of us including our viewers, you know, who know the acronym, and know that it's a, you know, scheduler slash cluster manager, resource manager, would that give you enough of a control plane and knowledge of sort of the resources out there, for you to be able to either instrument or deploy an instrument to all the clusters (mumbles). >> So we are actually leading the effort right now for big data on Kupernetes. So there is a group of, there's a small group working. It's Google, us, Red Hat, Palantir, Bloomberg now has joined the group as well. We are actually today talking about our effort on getting HDFS working on Kupernetes, so we see the writing on the wall. We clearly are positioning ourselves to be a player in that particular space, so we think we'll be ready and able to take that challenge on. >> Ash this is great stuff, we've just got about a minute before the break, so I wanted to ask you just a final question. You've been in the Spark community for a while, so what of their open source tools should we be keeping our eyes out for? >> Kupernetes. >> David: That's the one? >> To me that is the killer that's coming next. >> David: Alright. >> I think that's going to make life, it's going to unify the microservices architecture, plus the sort of multi data center and everything else. I think it's really, really good. Board works, it's been working for a long time. >> David: Alright, and I want to thank you for that little Pepper pen that I got over at your booth, as the coolest-- >> Come and get more. >> Gadget here. >> We also have Pepper sauce. >> Oh, of course. (laughter) Well there sir-- >> It's our sauce. >> There's the hot news from-- >> Ash: There you go. >> Pepperdata Ash Munshi. Thank you so much for being on the show, we appreciate it. >> Ash: My pleasure, thank you very much. >> And thank you for watching theCUBE. We're going to be back with more guests, including Ali Ghodsi, CEO of Databricks, coming up next. (upbeat music) (ocean roaring)
SUMMARY :
brought to you by Databricks. and here with George Gilbert from Wikibon, George. Alright and the guest of honor of course, I want you to just tell us real quick here, and then subsequent to that I did a bunch of startups, and it's actually taken over the world. and now everybody's taking advantage of the same thing. about some of the solutions that you have. So the classical methodologies that you have, Yeah, and George you look like And by the overhead I mean, you know, is sort of part of the costing if you will, and I saw you were announcing a couple of things. And the nice thing about that is, once you know that, And then it gives you a little, The Spark community has the best code names ever. is not just the Spark community, and like so, for the first question, that a lot of the Fortune 200, or our customers, and there's probably another 18 months to two years, and know that it's a, you know, scheduler Bloomberg now has joined the group as well. so I wanted to ask you just a final question. plus the sort of multi data center Oh, of course. Thank you so much for being on the show, we appreciate it. And thank you for watching theCUBE.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
David Goad | PERSON | 0.99+ |
Ash Munshi | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Ali Ghodsi | PERSON | 0.99+ |
Larry Ellison | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
Sean Suchter | PERSON | 0.99+ |
David | PERSON | 0.99+ |
Sean | PERSON | 0.99+ |
Ash | PERSON | 0.99+ |
Red Hat | ORGANIZATION | 0.99+ |
Oracle | ORGANIZATION | 0.99+ |
Yahoo | ORGANIZATION | 0.99+ |
Peter Cnudde | PERSON | 0.99+ |
2011 | DATE | 0.99+ |
DeepMind | ORGANIZATION | 0.99+ |
Bloomberg | ORGANIZATION | 0.99+ |
San Francisco | LOCATION | 0.99+ |
two guys | QUANTITY | 0.99+ |
Pepperdata | ORGANIZATION | 0.99+ |
24 hours | QUANTITY | 0.99+ |
first question | QUANTITY | 0.99+ |
Spark UI | TITLE | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
DevOps | TITLE | 0.99+ |
2012 | DATE | 0.99+ |
Chad Carson | PERSON | 0.99+ |
two years | QUANTITY | 0.99+ |
18 months | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
two | QUANTITY | 0.99+ |
one problem | QUANTITY | 0.99+ |
last July | DATE | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
ORGANIZATION | 0.99+ | |
Spark Summit 2017 | EVENT | 0.99+ |
Code Analyzer | TITLE | 0.99+ |
Spark | TITLE | 0.98+ |
100,000 nodes | QUANTITY | 0.98+ |
today | DATE | 0.98+ |
Palantir | ORGANIZATION | 0.98+ |
an hour | QUANTITY | 0.98+ |
IBM Research | ORGANIZATION | 0.98+ |
Both | QUANTITY | 0.98+ |
two gentlemen | QUANTITY | 0.98+ |
Chad | PERSON | 0.98+ |
two stages | QUANTITY | 0.98+ |
first guys | QUANTITY | 0.98+ |
both | QUANTITY | 0.97+ |
thousands of machines | QUANTITY | 0.97+ |
each one | QUANTITY | 0.97+ |
tens of hours | QUANTITY | 0.95+ |
Kupernetes | ORGANIZATION | 0.95+ |
MapReduce | TITLE | 0.95+ |
Yahoo Search | ORGANIZATION | 0.94+ |
Day 2 Kickoff - #SparkSummit - #theCUBE
[Narrator] Live from San Francisco it's the Cube covering Sparks Summit 2017 brought to you by databricks. >> Welcome to the Cube. My name is David Goad and I'm your host and we are here at Spark day two. It's the Spark Summit and I am flanked by a couple of consultants here from-- sorry, analysts from Wikibon. I got to get this straight. To my left we have Jim Kobielus who is our lead analysist for Data Science. Jim, welcome to the show. >> Thanks David. >> And we also have George Gilbert who is the lead analyst for Big Data and Analytics. I'll get this right eventually. So why don't we start with Jim. Jim just kicking off the show here today, we wanted to get some preliminary thoughts before we really jump into the rest of the day. What are the big themes that we're going to hear about? >> Yeah, today is the Enterprise day at Sparks Summit. So Spark for the Enterprise. Yesterday was focused on Spark, the evolution, extension of Spark to support for native development of deep learning as well as speeding up Spark to support sub-millisecond latencies. But today it's all about Spark and the Enterprise really what I call wrapping dev-ops around Spark, making it more productionizable, supportable. The databricks serverless announcement, though it was announced yesterday, the press release went up they're going into some depth right now in the key note about serverless and really serverless is all about providing an in cloud Spark, essentially a sand box for teams of developers to scale up and scale out enough resources to do the modeling, the training, the deployment, the iteration, the evaluation of Spark jobs in essentially a 24 by seven multi-tenant fully supported environment. So it's really about driving this continuous Spark development and iteration process into a 24 by seven model in the Enterprise, which is really what's happening is that data scientists, Spark developers are becoming an operational function that businesses are building, strategic infrastructure around things like recommendation engines, and e-commerce environments, absolutely demand 24 by seven resilience Spark team based collaboration environments, which is really what the serverless announcement is all about. >> David: So getting increasing demand on mission critical problems so that optimization is a big deal. >> Yeah, data science is not just an R&D function, it's an operational IT function as well. So that's what it's all about. >> David: Awesome, well let's go to George. I saw you watching the key note. I think still watching it again this morning, so taking notes feverishly. What were some of the things that stuck out to you from the key note speaker this morning? >> There are some things that are sort of going to bleed over from yesterday where we can explore some more. We're going to have on the show, the chief architect, Renald Chin, and the CEO, Ali Goatsee, and some of the things that we want to understand is how the scope of applications that are appropriate for Spark are expanding. We've got sort of unofficial guidance yesterday that, you know, just because Spark doesn't handle key value stores or databases all that tightly right now, that doesn't mean it won't in the future on the Apache Spark side through better APIs and on the databricks side, perhaps custom integration and the significance of that is that you can open up a whole class of operational apps, apps that run your business and that now incorporate, you know, rich analytics as well. Another thing that we'll want to be asking about is, keying off what Jim was saying, now that this becomes not a managed service where you just take the labor that the end customer was applying to get the thing running but it's now automated and you don't even know the infrastructure. We'll want to know what does that mean for the edge, you know, where we're doing analytics close to internet of things and people and sort of if there has to be a new configuration of Spark to work with that. And then of course what do we do about the whole data science process and the dev-ops for data science when you have machine learning distributed across the cloud and edge and On-Prem. >> Jim: In fact, I know we have Pepperdata coming on right after this, who might be able to talk about that exact dev-ops in terms of performance optimization into distributed Spark environment, yeah. >> George, I want to follow up with that. We had Matt Fryer from Hotels.com, he's going to be on our show later but he was on the key note stage this morning. He talked about going all cloud, all Spark, and how data science is even competitive advantage for Hotels.com. What do you want to dig into when we get him on the show? >> That's a really good question because if you look at business strategy, you don't really build a sustainable advantage just by doing one thing better than everyone else. That's easier to pick off. The sustainable strategic advantages come from not just doing one thing better than everyone else but many things and then orchestrating their improvement over time and I'd like to dig into how they're going to do that. 'Cause remember Hotels.com it's the internet equivalent descendant of the original travel reservation systems, which did confer competitive advantage on the early architects and deployers of that technology. >> Great and then Pepperdata wanted to come back and we're going to have them on the show here in just a moment. What would you like to learn from them? What do you think will benefit the community the most? >> Jim: Actually, keying off something George said, I'd like to get a sense for how you optimize Spark deployments in a radically distributed IOT edge environment. Whether they've got any plans, or what their thoughts are in terms of the challenges there. As more the intelligence gets pushed to the edge much of that will be on machine learning and deep learning models built into Spark. What are the challenges there? I mean, if you've got thousands to millions of end points that are all autonomious and intelligent and they're all running Spark, just what are the orchestration requirements, what are the resource management requirements, how do you monitor end-to-end in and environment like that and optimize the passing of data and the transfer of the control flow or orchestration across all those dispersed points. >> Okay, so 30 seconds now, why should the audience tune into our show today? What are they going to get? >> I think what they're going to get is a really good sense for how the emerging best practices for optimizing Spark in a distributed fog environment out to the edge where not just the edge devices but everything, all nodes, will incorporate machine learning and deep learning. They'll get a sense for what's been done today, what's the tooling is to enable dev-ops in that kind of environment. As well as, sort of the emerging best practices for compressing more of these algorithms and the data itself as well as doing training in a theoretically federated environment. I'm hoping to hear from some of the vendors who are on the show today. >> David: Fantastic and George, closing thoughts on the opening segment? 30 seconds. >> Closing thoughts on the opening segment. Like Jim is, we want to think about Spark holistically and it has traditionally been best position that's sort of this-- as Tay acknowledged yesterday sort of this offline branch of analytics that you apply to data like sort of repository that you accumulated and now we want to see it put into production but to do that you need more than just what Spark is today. You need basically a database or key value kind of option so that your storing your work as it goes along so you can go back and analyze it either simple analysis or complex analysis. So I want to hear about that. I want to hear about their plans for IOT. Spark is kind of a heavy weight environment, so you're probably not going to put it in the boot of your car or at least not likely anytime soon. >> Jim: Intelligent edge. I mean, Microsoft build a few weeks ago was really deep on intelligent edge. HP, who we're doing their show actually I think it's in Vegas, right? They're also big on intelligent edge. In fact, we had somebody on the show yesterday from HP going into some depth on that. I want to hear what databricks has to say on that theme. >> Yeah, and which part of the edge, is it the gateway, the edge gateway, which is really a slim down server, or the edge device, which could be a 32 bit meg RAM network card. >> Yeah. >> All right, well gentlemen appreciate the little insight here before we get started today and we're just getting started. Thank you both for being on the show and thank you for watching the Cube. We'll be back in a little while with our CEO from databricks. Thanks for watching. (upbeat music)
SUMMARY :
brought to you by databricks. It's the Spark Summit and I am flanked by What are the big themes that we're going to hear about? So Spark for the Enterprise. so that optimization is a big deal. So that's what it's all about. from the key note speaker this morning? and some of the things that we want to understand is Jim: In fact, I know we have Pepperdata coming on and how data science is and I'd like to dig into how they're going to do that. What would you like to learn from them? As more the intelligence gets pushed to the edge and the data itself David: Fantastic and George, but to do that you need more than just what Spark is today. I want to hear what databricks has to say on that theme. or the edge device, and thank you for watching the Cube.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jim | PERSON | 0.99+ |
Jim Kobielus | PERSON | 0.99+ |
David | PERSON | 0.99+ |
George | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Ali Goatsee | PERSON | 0.99+ |
David Goad | PERSON | 0.99+ |
Matt Fryer | PERSON | 0.99+ |
Renald Chin | PERSON | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
San Francisco | LOCATION | 0.99+ |
thousands | QUANTITY | 0.99+ |
30 seconds | QUANTITY | 0.99+ |
Hotels.com | ORGANIZATION | 0.99+ |
yesterday | DATE | 0.99+ |
Vegas | LOCATION | 0.99+ |
32 bit | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
24 | QUANTITY | 0.99+ |
HP | ORGANIZATION | 0.99+ |
Spark | TITLE | 0.99+ |
seven | QUANTITY | 0.98+ |
Yesterday | DATE | 0.98+ |
both | QUANTITY | 0.98+ |
Spark Summit | EVENT | 0.98+ |
Tay | PERSON | 0.97+ |
Sparks Summit 2017 | EVENT | 0.96+ |
one | QUANTITY | 0.96+ |
this morning | DATE | 0.96+ |
Pepperdata | ORGANIZATION | 0.96+ |
Day 2 | QUANTITY | 0.95+ |
Wikibon | ORGANIZATION | 0.94+ |
Sparks Summit | EVENT | 0.93+ |
databricks | ORGANIZATION | 0.91+ |
day two | QUANTITY | 0.87+ |
Spark | ORGANIZATION | 0.86+ |
few weeks ago | DATE | 0.86+ |
millions of end points | QUANTITY | 0.81+ |
Big Data | ORGANIZATION | 0.81+ |
Cube | COMMERCIAL_ITEM | 0.68+ |
sub | QUANTITY | 0.6+ |
Apache Spark | TITLE | 0.55+ |
Analytics | ORGANIZATION | 0.53+ |
Day One Wrap - #SparkSummit - #theCUBE
>> Announcer: Live from San Francisco, it's the CUBE covering Spark Summit 2017, brought to by Databricks. (energetic music plays) >> And what an exciting day we've had here at the CUBE. We've been at Spark Summit 2017, talking to partners, to customers, to founders, technologists, data scientists. It's been a load of information, right? >> Yeah, an overload of information. >> Well, George, you've been here in the studio with me talking with a lot of the guests. I'm going to ask you to maybe recap some of the top things you've heard today for our guests. >> Okay so, well, Databricks laid down, sort of, three themes that they wanted folks to take away. Deep learning, Structured Streaming, and serverless. Now, deep learning is not entirely new to Spark. But they've dramatically improved their support for it. I think, going beyond the frameworks that were written specifically for Spark, like Deeplearning4j and BigDL by Intel And now like TensorFlow, which is the opensource framework from Google, has gotten much better support. Structured Streaming, it was not clear how much more news we were going to get, because it's been talked about for 18 months. And they really, really surprised a lot of people, including me, where they took, essentially, the processing time for an event or a small batch of events down to 1 millisecond. Whereas, before, it was in the hundreds if not higher. And that changes the type of apps you can build. And also, the Databricks guys had coined the term continuous apps, which means they operate on a never-ending stream of data, which is different from what we've had in the past where it's batch or with a user interface, request-response. So they definitely turned up the volume on what they can do with continuous apps. And serverless, they'll talk about more tomorrow. And Jim, I think, is going to weigh in. But it, basically, greatly simplifies the ability to run this infrastructure, because you don't think of it as a cluster of resources. You just know that it's sort of out there, and you ask requests of it, and it figures out how to fulfill it. I will say, the other big surprise for me was when we have Matei, who's the creator of Spark and the chief technologist at Databricks, come on the show and say, when we asked him about how Spark was going to deal with, essentially, more advanced storage of data so that you could update things, so that you could get queries back, so that you could do analytics, and not just of stuff that's stored in Spark but stuff that Spark stores essentially below it. And he said, "You know, Databricks, you can expect to see come out with or partner with a database to do these advanced scenarios." And I got the distinct impression, and after listen to the tape again, that he was talking about for Apache Spark, which is separate from Databricks, that they would do some sort of key-value store. So in other words, when you look at competitors or quasi-competitors like Confluent Kafka or a data artist in Flink, they don't, they're not perfect competitors. They overlap some. Now Spark is pushing its way more into overlapping with some of those solutions. >> Alright. Well, Jim Kobielus. And thank you for that, George. You've been mingling with the masses today. (laughs) And you've been here all day as well. >> Educated masses, yeah, (David laughs) who are really engaged in this stuff, yes. >> Well, great, maybe give us some of your top takeaways after all the conversations you've had today. >> They're not all that dissimilar from George's. What Databricks, Databricks of course being the center, the developer, the primary committer in the Spark opensource community. They've done a number of very important things in terms of the announcements today at this event that push Spark, the Spark ecosystem, where it needs to go to expand the range of capabilities and their deployability into production environments. I feel the deep-learning side, announcement in terms of the deep-learning pipeline API very, very important. Now, as George indicated, Spark has been used in a fair number of deep-learning development environments. But not as a modeling tool so much as a training tool, a tool for In Memory distributed training of deep-learning models that we developed in TensorFlow, in Caffe, and other frameworks. Now this announcement is essentially bringing support for deep learning directly into the Spark modeling pipeline, the machine-learning modeling pipeline, being able to call out to deep learning, you know, TensorFlow and so forth, from within MLlib. That's very important. That means that Spark developers, of which there are many, far more than there are TensorFlow developers, will now have an easy pass to bring more deep learning into their projects. That's critically important to democratize deep learning. I hope, and from what I've seen what Databricks has indicated, that they have support currently in API reaching out to both TensorFlow and Keras, that they have plans to bring in API support for access to other leading DL toolkits such as Caffe, Caffe 2, which is Facebook-developed, such as MXNet, which is Amazon-developed, and so forth. That's very encouraging. Structured Streaming is very important in terms of what they announced, which is an API to enable access to faster, or higher-throughput Structured Streaming in their cloud environment. And they also announced that they have gone beyond, in terms of the code that they've built, the micro-batch architecture of Structured Streaming, to enable it to evolve into a more true streaming environment to be able to contend credibly with the likes of Flink. 'Cause I think that the Spark community has, sort of, had their back against the wall with Structured Streaming that they couldn't fully provide a true sub-millisecond en-oo-en latency environment heretofore. But it sounds like with this R&D that Databricks is addressing that, and that's critically important for the Spark community to continue to evolve in terms of continuous computation. And then the serverless-apps announcement is also very important, 'cause I see it as really being, it's a fully-managed multi-tenant Spark-development environment, as an enabler for continuous Build, Deploy, and Testing DevOps within a Spark machine-learning and now deep-learning context. The Spark community as it evolves and matures needs robust DevOps tools to production-ize these machine-learning and deep-learning models. Because really, in many ways, many customers, many developers are now using, or developing, Spark applications that are real 24-by-7 enterprise application artifacts that need a robust DevOps environment. And I think that Databricks has indicated they know where this market needs to go and they're pushing it with R&D. And I'm encouraged by all those signs. >> So, great. Well thank you, Jim. I hope both you gentlemen are looking forward to tomorrow. I certainly am. >> Oh yeah. >> And to you out there, tune in again around 10:00 a.m. Pacific Time. We're going to be broadcasting live here. From Spark Summit 2017, I'm David Goad with Jim and George, saying goodbye for now. And we'll see you in the morning. (sparse percussion music playing) (wind humming and waves crashing).
SUMMARY :
Announcer: Live from San Francisco, it's the CUBE to customers, to founders, technologists, data scientists. I'm going to ask you to maybe recap And that changes the type of apps you can build. And thank you for that, George. after all the conversations you've had today. for the Spark community to continue to evolve I hope both you gentlemen are looking forward to tomorrow. And to you out there, tune in again
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jim Kobielus | PERSON | 0.99+ |
Jim | PERSON | 0.99+ |
George | PERSON | 0.99+ |
David | PERSON | 0.99+ |
David Goad | PERSON | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Matei | PERSON | 0.99+ |
tomorrow | DATE | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
hundreds | QUANTITY | 0.99+ |
Spark | TITLE | 0.99+ |
both | QUANTITY | 0.98+ |
ORGANIZATION | 0.98+ | |
Intel | ORGANIZATION | 0.98+ |
Spark Summit 2017 | EVENT | 0.98+ |
18 months | QUANTITY | 0.98+ |
Flink | ORGANIZATION | 0.97+ |
ORGANIZATION | 0.97+ | |
Confluent Kafka | ORGANIZATION | 0.97+ |
Caffe | ORGANIZATION | 0.96+ |
today | DATE | 0.96+ |
TensorFlow | TITLE | 0.94+ |
three themes | QUANTITY | 0.94+ |
10:00 a.m. Pacific Time | DATE | 0.94+ |
CUBE | ORGANIZATION | 0.94+ |
Deeplearning4j | TITLE | 0.94+ |
Spark | ORGANIZATION | 0.93+ |
1 millisecond | QUANTITY | 0.93+ |
Keras | ORGANIZATION | 0.91+ |
Day One | QUANTITY | 0.81+ |
BigDL | TITLE | 0.79+ |
TensorFlow | ORGANIZATION | 0.79+ |
7 | QUANTITY | 0.77+ |
MLlib | TITLE | 0.73+ |
Caffe 2 | ORGANIZATION | 0.7+ |
Caffe | TITLE | 0.7+ |
24- | QUANTITY | 0.68+ |
MXNet | ORGANIZATION | 0.67+ |
Apache Spark | ORGANIZATION | 0.54+ |
Shafaq Abdullah, The Honest Company - #SparkSummit - #theCUBE
>> Announcer: Covering Spark Summit 2017, brought to you by Databricks. >> This is theCUBE, and we're having a great time at Spark Summit 2017. One of our last guests of the day is Shafaq Abdullah, who is the director of data infrastructure at the Honest Company. Shafaq, welcome to the show. >> Thank you. >> Now, I heard about The Honest Company because of the celebrity founder, right, Jessica Alba? >> Shafaq: That's correct. >> Okay, but how did you end up at the company, weren't you at a start-up before? >> That's exactly correct. So, basically, we did a start-up called InSnap before we actually got into Honest, and the way it happened is that, Insnap was more about instantaneous building personas and using machine learning and Big Data Stack, and Honest at that time was trying to find someone who could them with the data challenges. So, Insnap was the right piece in some of its technology and expertise in big data and machine learning, so we basically built a real-time, instantaneous personas to increase engagement and monetization. It was backed up by Big Data machine learning and Spark instead of our technology. So we used that to basically help Honest to really become data driven, to solve their next generation problem of making products which drive value out of data, and understand their customers better, operate better business, optimize business better. That is why they acquired us, and essentially, we deal with the technology in their stack, not only the technology, but also the culture, the business processes, and the teams which operate those. >> Okay, we're going to dive into some of the technical details about what you're developing with George in just a second, but I have to ask, the company culture is really important at The Honest Company, right? They're well known for being eco-friendly and socially responsible. What was it like moving from a start-up into that company environment, or was it just a natural? >> Basically, of course, Honest was a much bigger start-up for four or five years after it was initially created, so we at Insnap, very lean, agile and much more data driven. That was a bigger difference. So the way we solved it was, we actually, they actually allowed us to create our data organization called Data Signs, which was heading all the data initiatives. And then, we worked with other cross function teams, with finance, with accounting, with growth, with sales to basically help them understand what their needs are, and how to become really data driven by driving the value out of the data by using the state of the art technology. So it was a mix of team alignment and cultural change, focused on the business goal, and getting agrigate to gather around it to make the change. I really enjoyed that while we actually carried out this journey of Honest from being just descriptive, which is essentially just finding what has happened in the data, just generating reports for revenue. By becoming more predictive and prescriptive, which is more like advanced analytics and also advanced advisory role, which together plays in making decisions around features, around businesses and the operations. >> And George, you talked to a lot of customers today, and some of the same themes. Do you want to drill down some of the details of what they're doing. >> I'm curious about how you chose the first projects to get quick wins and to establish credibility? >> Yeah, that's actually a very good question. Basically, we were focused around the low hanging fruit in order to give us a jump-start and to build our reputation so that we can actually take much more advanced technology projects. And in order to do that, what we did was, if you go to Honest.com, and you search in their search bar, their search was very flimsy, and it was not revealing good results. We already built our engine, like a matching engine, so it was very easy to extend it into a full search engine. That was the first deliverable which we could deliver, and we delivered it under a month and a half or two months, right when we came in. And it was like, hey, these guys just improved our search by 10x or 100x; we are getting much more hits, much more coverage of the third strums And that served the tone. Then it was like we also wanted to, another piece which we wanted to tackle was, how do we improve Honest recommendations. That was another project. But before doing that, Honest did not even have a data warehouse, which it could call an advisor warehouse, so that you can get all the data in one place, like a like a data lake, because the data was siloed in organizations, and the analysts could not really get the data into one place and mix and match and analyze the data. So that was another big piece which we did, but we did it very early on. That was the second big deliverable, even before recommendation, the data warehouse. So basically, we plugged in Spark right in the middle, suck up all the data from different places, shove the data in, made this ETL king, which basically extracted, transformed and loaded the data into the data warehouse. Now, this data warehouse basically broke away those silos and made them a cohesive data lake which could be used for driving value and understanding patterns, especially for machine learning, analysts and all the decision makers. >> Was it a data warehouse, or was it a data lake? The reason I ask for distinction is, data warehouse is usually extremely well curated for navigation and discoverability, whereas the data lake is, as some people say, nuts, a little step up from a swamp. >> That's right, so basically, when I call it a data lake, I actually call it, because we have two data aggregation or data gathering infrastructure. One is backed by Spark and S3, which we call a data lake, where unstructured, structured data, there are all kinds of data there, mix and match, and it's not that easy sometimes, you need to do some transformation on top of the data, which is sitting there in order to really get to the needle in the haystack. But data warehouse has in grad shift, which basically gets the data from the data lake, or like the Spark ideal engine, and then makes it more like a metric-driven report, so that it's easily discoverable and it is more like what the business requires right now. It's more like formal reports, and the dimensions and all those attributes are much more well thought of. Whereas data lake is kind of like throwing it all in one piece so that at least we have the data in one place, and then we can analyze and process it. >> In putting all the data first in the data lake and then, essentially, refining it into the data warehouse, what did you use to keep track of the lineage and to make sure that you knew the truth, or truthfulness, behind all the data in the data warehouse once it got there? >> So basically, we built data model on top of S3 and Spark. We used that data model as a basis, as a source of truth to feed in the reports, and that data model was consistent across wherever you find it. So we want to make sure that those attributes, those dimensions and anything related to that data model for the e-com as well as offline patron is consistent. And so we use Spark, we use S3, essentially, to get that data model consistent, and also, we use a bunch of advanced monitoring stuff for that. When we are processing jobs, we want to make sure that we don't lose the data, and we remove the coupling between the systems by decoupling them, and essentially, in the next version, we made it even stream, even based streams, so that was like general strategy which we adopted in order to make sure that we have consistency around data lake and data warehouse. >> What would be the next step? So, now you've significantly enhanced business intelligence, and you have the richest repository behind that data warehouse. What would you do either with the data in the data warehouse or the data in the data lake repository? >> So we are constantly enriching our data lake because that needs to be updated all the time, but at the same time, we want to connect business with our metrics; they essentially derive all of that data which is sitting in the data lake to help optimize a problem. For example, we are working on sales optimization. We are working on operations optimization, demand planning, supply planning, in addition to customer insights. We are also working on other strategic project. For example, instead of just recommending or predicting LTV return, what we are doing is, we are trying to be more descriptive in our analytics in which it takes an advisory role, and looks over all the marketing spend, not just predict the high LTV customers, but actually allocates budget for different marketing spend across different channels for omni comment. For example, TV display ads, you know, all of that, so that's also happening as we speak, as we enrich our data lake and essentially generate those reports. Now, then we also need to circle back with the business folks or decision makers in order to really convince them to use that. So that's why we created these cross-functional teams, aligned to a business goal contextually aware teams, which know their roles and responsibilities, but at the same time, which can collaborate effectively and produce a result which drives the bottom line. >> What kind of customer insights were you looking for? Do they deliver family products, diapers to the home and that sort of thing? What sort of customer insights were you looking for and how is it working? >> Basically, Honest, in all our target customers, we need to better understand what their needs are. So customer insights, for example, the demographics of the customers. In addition we also wanted to see what are the things, what are the patterns which are common in customer, so that we can recommend products which are being bought by one segment of customer versus another. Those common properties, it could be mothers, who have recently had children, but who live in this neighborhood and have this kind of income level. So how do we ensure that we actually predict their demands before it actually happens. So we need to understand their habits, we need to understand the context behind it, if we are making some search, how many pages they use for this kind of product or that kind of a product, and similarly other things which enhance the understanding of the customers, make them into different buckets of segments, and then using those segments to target, because we already have data about LTV and turn as predictive models revealing if a customer is going to turn for whatever reason, we know by doing a similar campaign for other customers this has successfully given us more subscriptions or helped us to reduce a turn, that is how we target them and optimize our campaigns or our promotions for that. >> David: Sure. >> We're also looking for the overall lifestyle of the people who are passionate about Honest brands or brands that exhibit similar values, for example, eco-friendly, safe, and trusted products. >> Right, so we have just a couple of minutes to go before we get to the break. This is great stuff and George, I'll come back to you for a final question in just a moment, but in 30 seconds or so, tell us why you selected Databricks. You probably looked at other options, right? >> Shafaq: Absolutely. >> Can you give us a quick, why you made the decision? >> Absolutely, when we came in at Honest, all they had was a bunch of my secret developers, and very limited big data knowledge. So, now that they need a jump start in order to really get to that level in a very small time. How that's even achievable? We don't even have dedicated data-ops on our team. So basically, Databricks helped to bridge that gap by allowing us to get the infrastructure efficiency we needed by spinning up in hassle free manner. They also had this notebooks feature where we can scale the code and scale the team by actually reusing the boiler plate code, and similarly, different teams have different expertise. For example data science teams like Biton and data engineers like Scallop. So now those Scallop people write function which can be called by teams in data science in the same notebook, essentially giving them the ability to collaborate effectively. And then we also needed some tool to give more traction and visualization for data scientists well as data engineers. Databricks has a big visualization built in which helps to understand the causation corelation at least corelation right of the band, without even importing the data into our, or some other external tool, and making those charts. So there are a bunch of advantages around which we wanted. And then it has a platform API, like DBFS, like a disability files, it's similar to our vestry, which are cool APIs which again provide us the jump start which we needed, in so less amount of time, we actually made those, not only data warehouse, but also data driven parts. >> It sounds like Databricks has delivered. >> Shafaq: Oh yeah. >> Awesome. All right, George, just enough time for one more question if you want to throw on in. >> This one is kind of technical, but not on the technology side so much as, how do you guys measure attribution between channels and omni-channel marketing? >> That's a very good question. We have this project called Marketing Attribution, and essentially, the scope of that project is, we want to give the right ways to the right clicks of the customer as a journey of subscription or conversion. So, we have a model which basically use a bunch of techniques, including weighted and linear regression to basically come up with some kind of a weighted way of allowing those weights to be distributed among different channels. And then we also, the first problem to solve is that we needed to instrument logging so that we get those clicks and searches, all of that, into our data lake. That was done before hand, before starting the MT project, because we have a bunch of touch points. Customer could be doing search, he could be calling our sales rep, he could be tracking his order online, or he could be just leaving his cart in a state which is not fulfilled. And then, now we are trying to get it offline also, on top of that, and we are working on to get so that we know what a customer is doing in store and we have seamless experience using this MTA as a next version of it to give them a seamless experience in brick and mortar store or online. >> Great, that's great stuff, Shafaq. I wish we had more time to go. We'll talk to you more after we stop rolling. Thank you for being so honest, and we appreciate you being on the show. >> Thank you, I really appreciate it. >> Thank you so much. >> George: Shafaq, that was Great. >> All right, to all of you, thank you so much. We're going to be back in a few moments with the daily wrap up. You don't want to miss that. Thank you for joining us on theCUBE for Spark Summit 2017.
SUMMARY :
brought to you by Databricks. One of our last guests of the day is Shafaq Abdullah, and essentially, we deal with the technology in their stack, some of the technical details about what you're developing So the way we solved it was, we actually, and some of the same themes. our reputation so that we can actually Was it a data warehouse, or was it a data lake? and then we can analyze and process it. in order to make sure that we have consistency or the data in the data lake repository? but at the same time, we want to connect so that we can recommend products We're also looking for the overall lifestyle of the people to go before we get to the break. in so less amount of time, we actually made those, for one more question if you want to throw on in. so that we know what a customer is doing in store and we appreciate you being on the show. All right, to all of you, thank you so much.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Shafaq | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Jessica Alba | PERSON | 0.99+ |
David | PERSON | 0.99+ |
Shafaq Abdullah | PERSON | 0.99+ |
four | QUANTITY | 0.99+ |
five years | QUANTITY | 0.99+ |
30 seconds | QUANTITY | 0.99+ |
first | QUANTITY | 0.99+ |
Honest | ORGANIZATION | 0.99+ |
10x | QUANTITY | 0.99+ |
100x | QUANTITY | 0.99+ |
The Honest Company | ORGANIZATION | 0.99+ |
one piece | QUANTITY | 0.99+ |
Insnap | ORGANIZATION | 0.99+ |
One | QUANTITY | 0.98+ |
Databricks | ORGANIZATION | 0.98+ |
Honest.com | ORGANIZATION | 0.98+ |
Scallop | ORGANIZATION | 0.98+ |
S3 | TITLE | 0.98+ |
under a month and a half | QUANTITY | 0.98+ |
one place | QUANTITY | 0.98+ |
one segment | QUANTITY | 0.98+ |
first projects | QUANTITY | 0.98+ |
Spark Summit 2017 | EVENT | 0.98+ |
Honest Company | ORGANIZATION | 0.98+ |
first problem | QUANTITY | 0.97+ |
Spark | TITLE | 0.97+ |
one more question | QUANTITY | 0.97+ |
Data Signs | ORGANIZATION | 0.97+ |
InSnap | ORGANIZATION | 0.97+ |
two data | QUANTITY | 0.97+ |
today | DATE | 0.94+ |
Biton | ORGANIZATION | 0.94+ |
second big | QUANTITY | 0.86+ |
Spark | ORGANIZATION | 0.85+ |
two months | QUANTITY | 0.85+ |
third strums | QUANTITY | 0.78+ |
S3 | ORGANIZATION | 0.7+ |
a second | QUANTITY | 0.69+ |
MTA | TITLE | 0.67+ |
theCUBE | ORGANIZATION | 0.61+ |
Big | ORGANIZATION | 0.59+ |
couple of minutes | QUANTITY | 0.52+ |
last guests | QUANTITY | 0.5+ |
DBFS | TITLE | 0.49+ |
ETL | ORGANIZATION | 0.47+ |
#SparkSummit | EVENT | 0.39+ |
Jags Ramnarayan, SnappyData - Spark Summit 2017 - #SparkSummit - #theCUBE
(techno music) >> Narrator: Live from San Francisco, it's theCUBE, covering Spark Summit 2017. Brought to you by Databricks. >> You are watching the Spark Summit 2017 coverage by theCUBE. I'm your host David Goad, and joined with George Gilbert. How you doing George? >> Good to be here. >> And honored to introduce our next guest, the CTO from SnappyData, wow we were lucky to get this guy. >> Thanks for having me >> David: Jags Ramnarayan, Jags thanks for joining us. >> Thanks, thanks for having me. >> And for people who may not be familiar, maybe tell us what does SnappyData do? >> So SnappyData in a nutshell, is taking Spark, which is a computer engine, and in some sense augmenting the guts of Spark so that Spark truly becomes a hybrid database. A single data store that's capable of taking Spark streams, doing transactions, providing mutable state management in Spark, but most importantly being able to turn around, and run analytical queries on that state that is continuously merging. That's in a nutshell. Let me just say a few things, SnappyData itself is a startup that is a spun out, a spun out out of Pivotal. We've been out of Pivotal for roughly about a year, so the technology itself was to a great degree, incubated within Pivotal. It's a product called GemFire within VMware and Pivotal. So we took the guts of GemFire, which is an in-memory data base, designed for transactional low-latency, high confidence scenarios, and we are sort of fusing it, that's the key thing, fusing it into Spark, so that now Spark becomes significantly richer, as not just as a computer platform, but as a store. >> Great, and we know this is not your first Spark Summit, right? How many have you been to? Lost count? >> Boy, let's see, three, four now, Spark Summits, if I include the Spark Summit, this year, four to five. >> Great, so an active part of the community. What were you expecting to learn this year, and have you been surprised by anything? >> You know, it's always wonderful to see, I mean, every time I come to Spark, it's just a new set of innovations, right? I mean, when I first came to Spark, it was a mix of, let's talk about data frames, all of these, let's optimize my priorities. Today you come, I mean there is such a wide spectrum of amazing new things that are happening. It's just mind boggling. Right from AI techniques, structured streaming, and the real-time paradigm, and sort of this confluence that Databricks brings more to it. How can I create a confluence through a unified mechanism, where it is really brilliant, is what I think. >> Okay, well let's talk about how you're innovating at SnappyData. What are some of the applications or current projects you're working on? So number of things, I mean, GE is an investor in SnappyData. So we're trying to work with GE on the investor layer Dspace. We're working with large health care companies, also on their layer Dspace. So the part done with SnappyData is one that has a lot of high velocity streams of data emerging where the streams could be, for instance, Kafka streams driving Spark streams, but streams could also be operation databases. Your Postgres instance and your Cassandra database instance, and they're all generating continuous changes to data that's emerging in an operational world, can I suck that in and almost create a replica of that state that might be emerging in the SOQL operation environment, and still allow interactive analytics ASCIL for a number of concordant users on live data. Not cube data, not pre-aggregated data, but on live data itself, right? Being able to almost give you Google-like speeds to live data. >> George, we've heard people talking about this quite a bit. >> Yeah, so Jags, as you said upfront, Spark was conceived as sort of a general purpose, I guess, analytic compute engine, and adding DBMS to it, like sort of not bolting it on, but deeply integrating it, so that the core data structures now have DBMS properties, like transactionality, that must make a huge change in the scope of applications that are applicable. Can you desribe some of those for us? >> Yeah. The classic paradigm today that we find time and again as, the so-called smack stack, right? I mean lambda stack, now there's a smack stack. Which is really about Spark running on Mesos, but really using Spark streaming as an ingestion capability, and there is continuous state that is emerging that I want to write into Cassandra. So what we find very quickly is that the moment the state is emerging, I want to throw in a business intelligence tool on top and immediately do live dashboarding on that state that is continuously changing and emerging. So what we find is that the first part, which is the high speed drives, the ability to transform these data search, cleanse the data search, get the cleanse data into Cassandra, works really well. What is missing is this ability to say, well, how am I going to get insight? How can I ask you interesting, insightful questions, get responses immediately on that live data, right? And so the common problem there is the moment I have Cassandra working, let's say, with Spark, every time I run an analytical query, you only have two choices. One is use the parallel connector to pull in the data search from Cassandra, right, and now unfortunately, when you do analytics, you're working with large volumes. And every time I run even a simple query, all of a sudden I could be pulling in 10 gigabytes, 20 gigabytes of data into Spark to run the computation. Hundreds of seconds lost. Nothing like interactive, it's all about batch querying. So how can I turn around and say that if stuff changes in Cassandra, I can can have an immediate real-time reflection of that mutable state in Spark on which I can run queries rapidly. That's a very key aspect to us. >> So you were telling me earlier that you didn't see, necessarily, a need to replace entirely, the Cassandra in the smack stack, but to compliment it. >> Jags: That's right. >> Elaborate on that. >> So our focus, much like Spark, is all about in-memory, state management in-memory processing. And Cassandra, realistically, is really designed to say how can I scale the petabyte, right, for key value operations, semi-structured data, what have you. So we think there are a number of scenarios where you still want Cassandra to be your store, because in some sense a lot of these guys have already adapted Cassandra in a fairly big way. So you want to say, hey, leave your petabyte level wall in there, and you can essentially work with the real-time state, which could still be still many terabytes of state, essentially in main memory, that's going to work with specializing it. And we're also, I mean I can touch on this approximate query process and technology, which is other part, other key part here, to say hey, I can't really 1,000 cores, and 1,000 machines just so that you can do your job really well, so one of the techniques we are adopting, which even the Databricks guys stirred with Blink, essentially, it's an approximate query processing engine, we have our own essential approximate query processing engine, as an adjunct, essentially, to our store. What that essentially means is to say, can I take a billion records and synthesize something really, really small, using smart sampling techniques, sketching techniques, essentially statistical structures, that can be stored along with Spark and Spark memory itself, and fuse it with the Spark catalyst query engine. So that as you run your query and we can very smartly figure out, can I use the approximate data structures to answer the questions extremely quickly. Even when the data would be in petabyte volume, I have these data structures that just now taking, maybe gigabytes of storage only. So hopefully not getting too, too technical, so the Spark catalyst query optimizer, like an Oracle query optimizer, it knows about the data that it's going to query, only in your case, you're taking what catalyst knows about Spark, and extending it with what's stored in your native, also Spark native, data structures. >> That's right, exactly. So think about an optimizer always takes a query plan and says, here are all the possible plans you can execute, and here is cost estimate for these plans, we essentially inject more plans into that and hopefully, our plan is even more optimized than the plans that the Spark catalyst engine came up with. And Spark is beautiful because, the Catalyst engine is a very pluggable engine. So you can essentially augment that engine very easily. >> So you've been out in the marketplace, whether in alpha, beta, or now, production, for enough time so that the community is aware of what you've done. What are some of the areas that you're being pulled in that are, that people didn't associate Spark with? >> So more often, we land up in situations where they're looking at SAP HANA, as an example, maybe a Meme SQL, maybe just Postgres, and all of the sudden, there are these hybrid workloads, which is the Gartner term of HTAP, so there's a lot of HTAP use cases, where we get pulled into. So there's no Spark, but we get pulled into it because we just a hybrid database. That's what people look at us, essentially. >> Oh, so you pull Spark in because that's just part of your solution. >> Exactly, right. So think about Spark is not just data frames and rich API, but also it has a SQL interface, right. I can essentially execute, SQL, select SQL. Of course we augment that SQL so that now you can do what you expect from a database, which is an insert, an update, a delete, can I create a view, can I run a transaction? So all of a sudden, it's not just a Spark API but what we provide looks like a SQL database itself. >> Okay, interesting. So tell us, in the work with GE, they're among the first that have sort of educated the world that in that world there's so much data coming off devices, that we have to be intelligent about what we filter and send to cloud, we train models, potentially, up there, we run them closer to the edge, so that we get low latency analytics, but you were telling us earlier that there are alternatives, especially when you have such an intelligent database, working both at the edge and in the cloud. >> Right, so that's a great point. See what's happening with sort of a lot of these machine learning models is that these models are learned on historical data search. And quite often, especially if you look at predictive maintenance, those class of use cases, in industrial IRT, the parlance could evolve very rapidly, right? Maybe because of climate changes and let's say, for a windmill farm, there are few windmills that are breaking down so rapidly it's affecting everything else, in terms of the power generation. So being able to sort of order the model itself, incrementally and near real-time, is becoming more and more important. >> David: Wow. >> It's still a fairly academic research kind of area, but for instance, we are working very closely with the University of Michigan to sort of say, can we use some of these approximate techniques to incrementally also learn a model. Right, sort of incrementally augment a model, potential of the edge, or even inside the cloud, for instance. >> David: Wow. >> So if you're doing it at the edge, would you be updating the instance of the model associated with that locale and then would the model in the cloud be sort of like the master, and then that gets pushed down, until you have an instance and a master. >> That's right. See most typically what will happen is you have computed a model using a lot of historical data. You have typically supervised techniques to compute a model. And you take that model and inject it potentially into the edge, so that it can compute that model, which is the easy part, everybody does that. So you continue to do that, right, because you really want the data scientists to be pouring through those paradigms, looking and sort of tweaking those models. But for a certain number of models, even in the models injected in the edge, can I re-tweak that model in unsupervised way, is kind of the play, we're also kind of venturing into slowly, but that's all in the future. >> But if you're doing it unsupervised, do you need metrics that sort of flag, like what is the champion challenger, and figure out-- >> I should say that I mean, not all of these models can work in this very real-time manner. So, for instance, we've been looking at saying, can we reclassify NPC, the name place classifier, to essentially do incremental classification, or incrementally learning the model. Clustering approaches can actually be done in an unsupervised way in an incremental fashion. Things like that. There's a whole spectrum of algorithms that really need to be thought through for approximate algorithms to actually apply. So it's still an active research. >> Really great discussion, guys. We've just got about a minute to go, before the break, really great stuff. I don't want to interrupt you. But maybe switch real quick to business drivers. Maybe with SnappyData or with other peers you've talked to today. What business drivers do you think are going to affect the evolution of Spark the most? I mean, for us, as a small company, the single biggest challenge we have, it's like what one of you guys said, analysts, it's raining databases out there. And there's ability to constantly educate people how you can essentially realize a very next generation, like data pipeline, in a very simplified manner, is the challenge we are running into, right. I mean, I think the business model for us is primarily how many people are going to go and say, yes, batch related analytics is important, but incrementally, for competitive reasons, want to be playing that real-time analytics game lot more than before, right? So that's going to be big for us, and hopefully we can play a big part there, along with Spark and Databricks. >> Great, well we appreciate you coming on the show today, and sharing some of the interesting work that you're doing. George, thank you so much. and Jags, thank you so much for being on theCUBE. >> Thanks for having me on, I appreciate it. Thanks, George. And thank you all for tuning in. Once again, we have more to come, today and tomorrow, here at Spark Summit 2017, thanks for watching. (techno music)
SUMMARY :
Brought to you by Databricks. How you doing George? And honored to introduce our next guest, and in some sense augmenting the guts of Spark if I include the Spark Summit, this year, four to five. and have you been surprised by anything? and the real-time paradigm, and sort of this confluence So the part done with SnappyData is one about this quite a bit. so that the core data structures now have DBMS properties, that the moment the state is emerging, the Cassandra in the smack stack, but to compliment it. So that as you run your query and we can very So you can essentially augment that engine very easily. What are some of the areas that you're being pulled in maybe just Postgres, and all of the sudden, Oh, so you pull Spark in because So all of a sudden, it's not just a Spark API that have sort of educated the world So being able to sort of order the model itself, but for instance, we are working very closely in the cloud be sort of like the master, So you continue to do that, right, because you that really need to be thought through is the challenge we are running into, right. and sharing some of the interesting work that you're doing. And thank you all for tuning in.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
David Goad | PERSON | 0.99+ |
George | PERSON | 0.99+ |
University of Michigan | ORGANIZATION | 0.99+ |
1,000 machines | QUANTITY | 0.99+ |
20 gigabytes | QUANTITY | 0.99+ |
GE | ORGANIZATION | 0.99+ |
1,000 cores | QUANTITY | 0.99+ |
10 gigabytes | QUANTITY | 0.99+ |
David | PERSON | 0.99+ |
Spark | TITLE | 0.99+ |
San Francisco | LOCATION | 0.99+ |
SQL | TITLE | 0.99+ |
Spark | ORGANIZATION | 0.99+ |
Jags Ramnarayan | PERSON | 0.99+ |
first | QUANTITY | 0.99+ |
first part | QUANTITY | 0.99+ |
two choices | QUANTITY | 0.99+ |
SAP HANA | TITLE | 0.99+ |
tomorrow | DATE | 0.99+ |
Hundreds of seconds | QUANTITY | 0.99+ |
Gartner | ORGANIZATION | 0.99+ |
this year | DATE | 0.99+ |
Spark Summit 2017 | EVENT | 0.99+ |
Jags | PERSON | 0.99+ |
One | QUANTITY | 0.98+ |
today | DATE | 0.98+ |
Today | DATE | 0.98+ |
both | QUANTITY | 0.98+ |
Databricks | ORGANIZATION | 0.98+ |
Spark Summit | EVENT | 0.97+ |
single | QUANTITY | 0.97+ |
Kafka | TITLE | 0.97+ |
Oracle | ORGANIZATION | 0.97+ |
ORGANIZATION | 0.96+ | |
about a year | QUANTITY | 0.96+ |
Blink | ORGANIZATION | 0.95+ |
single data | QUANTITY | 0.93+ |
SnappyData | ORGANIZATION | 0.93+ |
Mesos | TITLE | 0.91+ |
three | QUANTITY | 0.91+ |
a billion records | QUANTITY | 0.91+ |
#SparkSummit | EVENT | 0.91+ |
Spark Summits | EVENT | 0.9+ |
four | QUANTITY | 0.89+ |
theCUBE | ORGANIZATION | 0.89+ |
Postgres | TITLE | 0.89+ |
one | QUANTITY | 0.88+ |
Cassandra | TITLE | 0.87+ |
Matthew Hunt | Spark Summit 2017
>> Announcer: Live from San Francisco, it's theCUBE covering Spark Summit 2017, brought to you by Databricks. >> Welcome back to theCUBE, we're talking about data signs and engineering at scale, and we're having a great time, aren't we, George? >> We are! >> Well, we have another guest now we're going to talk to, I'm very pleased to introduce Matt Hunt, who's a technologist at Bloomberg, Matt, thanks for joining us! >> My pleasure. >> Alright, we're going to talk about a lot of exciting stuff here today, but I want to first start with, you're a long-time member of the Spark community, right? How many Spark Summits have you been to? >> Almost all of them, actually, it's quite amazing to see the 10th one, yes. >> And you're pretty actively involved with the user group on the east coast? >> Matt: Yeah, I run the New York users group. >> Alright, well, what's that all about? >> We have some 2,000 people in New York who are interested in finding out what goes on, and which technologies to use, and what are people working on. >> Alright, so hopefully, you saw the keynote this morning with Matei? >> Yes. >> Alright, any comments or reactions from the things that he talked about as priorities? >> Well, I've always loved the keynotes at the Spark Summits, because they announce something that you don't already know is coming in advance, at least for most people. The second Spark Summit actually had people gasping in the audience while they were demoing, a lot of senior people-- >> Well, the one millisecond today was kind of a wow one-- >> Exactly, and I would say that the one thing to pick out of the keynote that really stood out for me was the changes in improvements they've made for streaming, including potentially being able to do sub-millisecond times for some workloads. >> Well, maybe talk to us about some of the apps that you're building at Bloomberg, and then I want you to join in, George, and drill down some of the details. >> Sure. And Bloomberg is a large company with 4,000-plus developers, we've been working on apps for 30 years, so we actually have a wide range of applications, almost all of which are for news in the financial industry. We have a lot of homegrown technology that we've had to adapt over time, starting from when we built our own hardware, but there's some significant things that some of these technologies can potentially really help simplify over time. Some recent ones, I guess, trade anomaly detection would be one. How can you look for patterns of insider trading? How can you look for bad trades or attempts to spoof? There's a huge volume of trade data that comes in, that's a natural application, another one would be regulatory, there's a regulatory system called MiFID, or MiFID II, the regulations required for Europe, you have to be able to record every trade for seven years, provide daily reports, there's clearly a lot around that, and then I would also just say, our other internal databases have significant analytics that can be done, which is just kind of scraping the surface. >> These applications sound like they're oriented towards streaming solutions, and really low latency. Has that been a constraint on what you can build so far? >> I would definitely say that we have some things that are latency constrained, it tends to be not like high frequency trading, where you care about microseconds, but milliseconds are important, how long does it take to get an answer, but I would say equally important with latency is efficiency, and those two often wind up being coupled together, though not always. >> And so when you say coupled, is it because it's a trade-off, or 'cause you need both? >> Right, so it's a little bit of both, for a number of things, there's an upper threshold for the latency that we can accept. Certain architectural changes imply higher latencies, but often, greater efficiencies. Micro-batching often means that you can simplify and get greater throughput, but at a cost of higher latency. On the other hand, if you have a really large volume of things coming in, and your method of processing them isn't efficient enough, it gets too slow simply from that, and that's why it's not just one or the other. >> So in getting down to one millisecond or below, can they expose knobs where you can choose the trade-offs between efficiency and latency, and is that relevant for the apps that you're building? >> I mean, clearly if you can choose between micro-batching and not micro-batching, that's a knob that you can have, so that's one explicit one, but part of what's useful is, often when you sit down to try and determine what is the main cause of latency, you have to look at the full profile of a stack of what it's going through, and then you discover other inefficiencies that can be ironed out, and so it just makes it faster overall. I would say, a lot of what the Databricks guys in the Spark community have worked on over the years is connected to that, Project Tungsten and so on, well, all these things that make things much slower, much less efficient than they need to be, and we can close that gap a lot, I would say that from the very beginning. >> This brings up something that we were talking about earlier, which is, Matei has talked for a long time about wanting to take N 10 control of continuous apps, for simplicity and performance, and so there's this, we'll write with transactional consistency, so we're assuring the customer of exactly one's semantics when we write to a file system or database or something like that. But, Spark has never really done native storage, whereas Matei came here on the show earlier today and said, "Well, Databricks as a company "is going to have to do something in that area," and he talks specifically about databases, and he said, he implied that Apache Spark, separate from Databricks, would also have to do more in state management, I don't know if he was saying key value store, but how would that open up a broader class of apps, how would it make your life simpler as a developer? >> Right. Interesting and great question, this is kind of a subject that's near and dear to my own heart, I would say. So part of that, when you take a step back, is about some of the potential promise of what Spark could be, or what they've always wanted to be, which is a form of a universal computation engine. So there's a lot of value, if you can learn one small skillset, but it can work in a wide variety of use cases, whether it's streaming or at rest or analytics, and plug other things in. As always, there's a gap in any such system between theory and reality, and how much can you close that gap, but as for storage systems, this is something that, you and I have talked about this before, and I've written about it a fair amount too, Spark is historically an analytic system, so you have a bunch of data, and you can do analytics on it, but where's that data come from? Well, either it's streaming in, or you're reading from files, but most people need, essentially, an actual database. So what constitutes the universal system? You need file store, you need a distributive file store, you need a database with generally transactional semantics because the other forms are too hard for people to understand, you need analytics that are extensible, and you need a way to stream data in, and there's how close can you get to that, versus how much do you have to fit other parts that come together, very interesting question. >> So, so far, they've sort of outsourced that to DIY, do-it-yourself, but if they can find a sufficiently scalable relational database, they can do the sort of analytical queries, and they can sort of maintain state with transactions for some amount of the data flowing through. My impression is that, like Cassandra would be the, sort of the database that would handle all updates, and then some amount of those would be filtered through to a multi-model DBMS. When I say multi-model, I mean handles transactions and analytics. Knowing that you would have the option to drop that out, what applications would you undertake that you couldn't use right now, where the theme was, we're going to take big data apps into production, and then the competition that they show for streaming is of Kafka and Flink, so what does that do to that competitive balance? >> Right, so how many pieces do you need, and how well do they fit together is maybe the essence of that question, and people ask that all the time, and one of the limits has been, how mature is each piece, how efficient is it, and do they work together? And if you have to master 5,000 skills and 200 different products, that's a huge impediment to real-world usage. I think we're coalescing around a smaller set of options, so in the, Kafka, for example, has a lot of usage, and it seems to really be, the industry seems to be settling on that is what people are using for inbound streaming data, for ingest, I see that everywhere I go. But what happens when you move from Kafka into Spark, or Spark has to read from a database? This is partly a question of maturity. Relational databases are very hard to get right. The ones that we have have been under development for decades, right? I mean, DB2 has been around for a really long time with very, very smart people working on it, or Oracle, or lots of other databases. So at Bloomberg, we actually developed our own databases for relational databases that were designed for low latency and very high reliability, so we actually just opensourced that a few weeks ago, it's called ComDB2, and the reason we had to do that was the industry solutions at the time, when we started working on that, were inadequate to our needs, but we look at how long that took to develop for these other systems and think, that's really hard for someone else to get right, and so, if you need a database, which everyone does, how can you make that work better with Spark? And I think there're a number of very interesting developments that can make that a lot better, short of Spark becoming and integrating a database directly, although there's interesting possibilities with that too. How do you make them work well together, we could talk about for a while, 'cause that's a fascinating question. >> On that one topic, maybe the Databricks guys don't want to assume responsibility for the development, because then they're picking a winner, perhaps? Maybe, as Matei told us earlier, they can make the APIs easier to use for a database vendor to integrate, but like we've seen Splice Machine and SnappyData do the work, take it upon themselves to take data frames, the core data structure, in Spark, and give it transactional semantics. Does that sound promising? >> There're multiple avenues for potential success, and who can use which, in a way, depends on the audience. If you look at things like Cassandra and HBase, they're distributing key value stores that additional things are being built on, so they started as distributed, and they're moving towards more encompassing systems, versus relational databases, which generally started as single image on single machine, and are moving towards federation distribution, and there's been a lot with that with post grads, for example. One of the questions would be, is it just knobs, or why don't they work well together? And there're a number of reasons. One is, what can be pushed down, how much knowledge do you have to have to make that decision, and optimizing that, I think, is actually one of the really interesting things that could be done, just as we have database query optimizers, why not, can you determine the best way to execute down a chain? In order to do that well, there are two things that you need that haven't yet been widely adopted, but are coming. One is the very efficient copy of data between systems, and Apache Arrow, for example, is very, very interesting, and it's nearing the time when I think it's just going to explode, because it lets you connect these systems radically more efficiently in a standardized way, and that's one of the things that was missing, as soon as you hop from one system to another, all of a sudden, you have the semantic computational expense, that's a problem, we can fix that. The other is, the next level of integration requires, basically, exposing more hooks. In order to know, where should a query be executed and which operator should I push down, you need something that I think of as a meta-optimizer, and also, knowledge about the shape of the data, or statistics underlying, and ways to exchange that back and forth to be able to do it well. >> Wow, Matt, a lot of great questions there. We're coming up on a break, so we have to wrap things up, and I wanted to give you at least 30 seconds to maybe sum up what you'd like to see your user community, the Spark community, do over the next year. What are the top issues, things you'd love to see worked on? >> Right. It's an exciting time for Spark, because as time goes by, it gets more and more mature, and more real-world applications are viable. The hardest thing of all is to get, anywhere you in any organization's to get people working together, but the more people work together to enable these pieces, how do I efficiently work with databases, or have these better optimizations make streaming more mature, the more people can use it in practice, and that's why people develop software, is to actually tackle these real-world problems, so, I would love to see more of that. >> Can we all get along? (chuckling) Well, that's going to be the last word of this segue, Matt, thank you so much for coming on and spending some time with us here to share the story! >> My pleasure. >> Alright, thank you so much. Thank you George, and thank you all for watching this segment of theCUBE, please stay with us, as Spark Summit 2017 will be back in a few moments.
SUMMARY :
covering Spark Summit 2017, brought to you by Databricks. it's quite amazing to see the 10th one, yes. and what are people working on. that you don't already know is coming in advance, and I would say that the one thing and then I want you to join in, George, you have to be able to record every trade for seven years, Has that been a constraint on what you can build so far? where you care about microseconds, On the other hand, if you have a really large volume and then you discover other inefficiencies and so there's this, we'll write and there's how close can you get to that, what applications would you undertake and so, if you need a database, which everyone does, and give it transactional semantics. it's just going to explode, because it lets you and I wanted to give you at least 30 seconds and that's why people develop software, Alright, thank you so much.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George | PERSON | 0.99+ |
Matt Hunt | PERSON | 0.99+ |
Bloomberg | ORGANIZATION | 0.99+ |
Matthew Hunt | PERSON | 0.99+ |
Matt | PERSON | 0.99+ |
Matei | PERSON | 0.99+ |
New York | LOCATION | 0.99+ |
San Francisco | LOCATION | 0.99+ |
30 years | QUANTITY | 0.99+ |
seven years | QUANTITY | 0.99+ |
each piece | QUANTITY | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
one | QUANTITY | 0.99+ |
one millisecond | QUANTITY | 0.99+ |
5,000 skills | QUANTITY | 0.99+ |
both | QUANTITY | 0.99+ |
two | QUANTITY | 0.99+ |
two things | QUANTITY | 0.99+ |
One | QUANTITY | 0.99+ |
Oracle | ORGANIZATION | 0.99+ |
Spark | TITLE | 0.98+ |
Europe | LOCATION | 0.98+ |
Spark Summit 2017 | EVENT | 0.98+ |
DB2 | TITLE | 0.98+ |
200 different products | QUANTITY | 0.98+ |
Spark Summits | EVENT | 0.98+ |
Spark Summit | EVENT | 0.98+ |
today | DATE | 0.98+ |
one system | QUANTITY | 0.97+ |
next year | DATE | 0.97+ |
4,000-plus developers | QUANTITY | 0.97+ |
first | QUANTITY | 0.96+ |
HBase | ORGANIZATION | 0.95+ |
second | QUANTITY | 0.94+ |
decades | QUANTITY | 0.94+ |
MiFID II | TITLE | 0.94+ |
one topic | QUANTITY | 0.92+ |
this morning | DATE | 0.92+ |
single machine | QUANTITY | 0.91+ |
One of | QUANTITY | 0.91+ |
ComDB2 | TITLE | 0.9+ |
few weeks ago | DATE | 0.9+ |
Cassandra | PERSON | 0.89+ |
earlier today | DATE | 0.88+ |
10th one | QUANTITY | 0.88+ |
2,000 people | QUANTITY | 0.88+ |
one thing | QUANTITY | 0.87+ |
Kafka | TITLE | 0.87+ |
single image | QUANTITY | 0.87+ |
MiFID | TITLE | 0.85+ |
Spark | ORGANIZATION | 0.81+ |
Splice Machine | TITLE | 0.81+ |
Project Tungsten | ORGANIZATION | 0.78+ |
theCUBE | ORGANIZATION | 0.78+ |
at least 30 seconds | QUANTITY | 0.77+ |
Cassandra | ORGANIZATION | 0.72+ |
Apache Spark | ORGANIZATION | 0.71+ |
questions | QUANTITY | 0.7+ |
things | QUANTITY | 0.69+ |
Apache Arrow | ORGANIZATION | 0.69+ |
SnappyData | TITLE | 0.66+ |
Rob Lantz, Novetta - Spark Summit 2017 - #SparkSummit - #theCUBE
>> Announcer: Live from San Francisco it's the CUBE covering Spark Summit 2017 brought to you by Data Bricks. >> Welcome back to the CUBE, we're continuing to take about two people who are not just talking about things but doing things. We're happy to have, from Novetta, the Director of Predictive Analytics, Mr. Rob Lantz. Rob, welcome to the show. >> Thank you. >> And off to my right, George, how are you? >> Good. >> We've introduced you before. >> Yes. >> Well let's talk to the guest. Let's get right to it. I want to talk to you a little bit about what does Novetta do and then maybe what apps you're building using Spark. >> Sure, so Novetta is an advanced analytics company, we're medium sized and we develop custom hardware and software solutions for our customers who are looking to get insights out of their big data. Our primary offering is a hard entity resolution engine. We scale up to billions of records and we've done that for about 15 years. >> So you're in the business end of analytics, right? >> Yeah, I think so. >> Alright, so talk to us a little bit more about entity resolution, and that's all Spark right? This is your main priority? >> Yes, yes, indeed. Entity resolution is the science of taking multiple disparate data sets, traditional big data, and taking records from those and determining which of those are actually the same individual or company or address or location and which of those should be kept separate. We can aggregate those things together and build profiles and that enables a more robust picture of what's going on for an organization. >> Okay, and George? >> So what did you do... What was the solution looking like before Spark and how did it change once you adopted Spark? >> Sure, so with Spark, it enabled us to get a lot faster. Obviously those computations scaled a lot better. Before, we were having to write a lot of custom code to get those computations out across a grid. When we moved to Hadoop and then Spark, that made us, let's say able to scale those things and get it done overnight or in hours and not weeks. >> So when you say you had to do a lot of custom code to distribute across the cluster, does that include when you were working with MapReduce, or was this even before the Hadoop era? >> Oh it was before the Hadoop era and that predates my time so I won't be able to speak expertly about it, but to my understanding, it was a challenge for sure. >> Okay so this sounds like a service that your customers would then themselves build on. Maybe an ETL customer would figure out master data from a repository that is not as carefully curated as the data warehouse or similar applications. So who is your end customer and how do they build on your solution? >> Sure, so the end customer typically is an enterprise that has large volumes of data that deal in particular things. They collect, it could be customers, it could be passengers, it could be lots of different things. They want to be able to build profiles about those people or companies, like I said, or locations, any number of things can be considered an entity. The way they build upon it then is how they go about quantifying those profiles. We can help them do that, in fact, some of the work that I manage does that, but often times they do it themselves. They take the resolve data and that gets resolved nightly or even hourly. They build those profiles themselves for their own purpose. >> Then, to help us think about the application or the use case holistically, once they've built those profiles and essentially harmonized the data, what does that typically feed into? >> Oh gosh, any number of things really. Oh, shoot. We've got deployments in AWS in the cloud, we've got deployments, lots of deployments on premises obviously. That can go anywhere from relational databases to graph query language databases. Lots of different places from there for sure. >> Okay so, this actually sounds like everyone talks now about machine learning and forming every category of software. This sounds like you take the old style ETL, where master data was a value add layer on top, and that was, it took a fair amount of human judgment to do. Now, you're putting that service on top of ETL and you're largely automating it, probably with, I assume, some supervised guidance, supervised training. >> Yes, so we're getting into the machine learning space as far as entity extraction and resolution and recognition because more and more data is unstructured. But machine learning isn't necessarily a baked in part of that. Actually entity resolution is a prerequisite, I think, for quality machine learning. So if Rob Lantz is a customer, I want to be able to know what has Rob Lantz bought in the past from me. And maybe what is Rob Lantz talking about in social media? Well I need to know how to figure out who those people are and who's Rob Lantz and who's Robert Lantz is a completely different person, I don't want to collapse those two things together. Then I would build machine learning on top of that to say, right, now what's his behavior going to be in the future. But once I have that robust profile built up, I can derive a lot more interesting features with which to apply the machine learning. >> Okay, so you are a Data Bricks customer and there's also a burgeoning partnership. >> Rob: Yeah, I think that's true. >> So talk to us a little bit about what are some of the frustrations you had before adopting Data Bricks and maybe why you choose it. >> Yeah, sure. So the frustrations primarily with a traditional Hadoop environment involved having to go from one customer site to another customer site with an incredibly complex technology stack and then do a lot of the cluster management for those customers even after they'd already set it up because of all the inner workings of Hadoop and that ecosystem. Getting our Spark application installed there, we had to penetrate layers and layers of configuration in order to tune it appropriately to get the performance we needed. >> David: Okay, and were you at the keynote this morning? >> I was not, actually. >> Okay, I'm not going to ask you about that then. >> Ah. >> But I am going to ask you a little bit about your wishlist. You've been talking to people maybe in the hallway here, you just got here today but, what do you wish the community would do or develop, what would you like to learn while you're here? >> Learning while I'm here, I've already picked up a lot. So much going on and it's such a fast paced environment, it's really exciting. I think if I had a wishlist, I would want a more robust ML Lib, machine learning library. All the things that you can get on traditional, in scientific computing stacks moved onto a Spark ML Lib for easier access. On a cluster would be great. >> I thought several years ago ML Lib took over from Mahoot as the most active open source community for adding, really, I thought, scale out machine learning algorithms. If it doesn't have it all now, or maybe all is something you never reach, kind of like Red Queen effect, you know? >> Rob: For sure, for sure. >> What else is attracting these scale out implementations of the machine learning algorithms? >> Um? >> In other words, what are the platforms? If it's not Spark then... >> I don't think it exists frankly, unless you write your own. I think that would be the way to go. That's the way to go about it now. I think what organizations are having to do with machine learning in a distributed environment is just go with good enough, right. Whereas maybe some of the ensemble methods that are, actually aren't even really cutting edge necessarily, but you can really do a lot of tuning on those things, doing that tuning distributed at scale would be really powerful. I read somewhere, and I'm not going to be able to quote exactly where it was but, actually throwing more data at a problem is more valuable than tuning a perfect algorithm frankly. If we could combine the two, I think that would be really powerful. That is, finding the right algorithm and throwing all the data at it would get you a really solid model that would pick up on that signal that underlies any of these phenomena. >> David: Okay well, go ahead George. >> I was going to ask, I think that goes back to, I don't know if it was Google Paper, or one of the Google search quality guys who's a luminary in the machine learning space says, "data always trumps algorithms." >> I believe that's true and that's true in my experience certainly. >> Once you had this machine learning and once you've perhaps simplified the multi-vendor stack, then what is your solution start looking like in terms of broadening its appeal, because of the lower TCO. And then, perhaps embracing more use cases. >> I don't know that it necessarily embraces more use cases because entity resolution applies so broadly already, but what I would say is will give us more time to focus on improving the ER itself. That's I think going to be a really, really powerful improvement we can make to Novetta entity analytics as it stands right now. That's going to go into, we alluded to before, the machine learning as part of the entity resolution. Entity extraction, automated entity extraction from unstructured information and not just unstructured text but unstructured images and video. Could be a really powerful thing. Taking in stuff that isn't tagged and pulling the entities out of that automatically without actually having to have a human in the loop. Pulling every name out, every phone number out, every address out. Go ahead, sorry. >> This goes back to a couple conversations we've had today where people say data trumps algorithms, even if they don't say it explicitly, so the cloud vendors who are sitting on billions of photos, many of which might have house street addresses and things like that, or faces, how do you make better... How do you extract better tuning for your algorithms from data sets that I assume are smaller than the cloud vendors? >> They're pretty big. We employ data engineers that are very experienced at tagging that stuff manually. What I would envision would happen is we would apply somebody for a week or two weeks, to go in and tag the data as appropriate. In fact, we have products that go in and do concept tagging already across multiple languages. That's going to be the subject of my talk tomorrow as a matter of fact. But we can tag things manually or with machine assistance and then use that as a training set to go apply to the much larger data set. I'm not so worried about the scale of the data, we already have a lot, a lot of data. I think it's going to be getting that proof set that's already tagged. >> So what you're saying is, it actually sounds kind of important. That actually almost ties into what we hear about Facebook training their messenger bot where we can't do it purely just on training data so we're going to take some data that needs semi-supervision, and that becomes our new labeled set, our new training data. Then we can run it against this broad, unwashed mass of training data. Is that the strategy? >> Certainly we would get there. We would want to get there and that's the beauty of what Data Bricks promises, is that ability to save a lot of the time that we would spend doing the nug work on cluster management to innovate in that way and we're really excited about that. >> Alright, we've got just a minute to go here before the break, so I wanted to ask you maybe, the wish list question, I've been asking everybody today, what do you wish you had? Whether it's in entity resolution or some other area in the next couple of years for Novetta, what's on your list? >> Well I think that would be the more robust machine learning library, all in Spark, kind of native, so we wouldn't have to deploy that ourselves. Then, I think everything else is there, frankly. We are very excited about the platform and the stack that comes with it. >> Well that's a great ending right there, George do you have any other questions you want to ask? Alright, we're just wrapping up here. Thank you so much, we appreciate you being on the show Rob, and we'll see you out there in the Expo. >> I appreciate it, thank you. >> Alright, thanks so much. >> George: It's good to meet you. >> Thanks. >> Alright, you are watching the CUBE here at Spark Summit 2017, stay tuned, we'll be back with our next guest.
SUMMARY :
brought to you by Data Bricks. Welcome back to the CUBE, I want to talk to you a little bit about and we've done that for about 15 years. and build profiles and that enables a more robust picture and how did it change once you adopted Spark? and get it done overnight or in hours and not weeks. and that predates my time and how do they build on your solution? and that gets resolved nightly or even hourly. We've got deployments in AWS in the cloud, and that was, it took a fair amount going to be in the future. Okay, so you are a Data Bricks customer and maybe why you choose it. to get the performance we needed. what would you like to learn while you're here? All the things that you can get on traditional, kind of like Red Queen effect, you know? If it's not Spark then... I read somewhere, and I'm not going to be able or one of the Google search quality guys and that's true in my experience certainly. because of the lower TCO. and pulling the entities out of that automatically that I assume are smaller than the cloud vendors? I think it's going to be getting that proof set Is that the strategy? is that ability to save a lot of the time and the stack that comes with it. and we'll see you out there in the Expo. Alright, you are watching the CUBE
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
David | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Rob Lantz | PERSON | 0.99+ |
Robert Lantz | PERSON | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Data Bricks | ORGANIZATION | 0.99+ |
a week | QUANTITY | 0.99+ |
Rob | PERSON | 0.99+ |
two | QUANTITY | 0.99+ |
ORGANIZATION | 0.99+ | |
AWS | ORGANIZATION | 0.99+ |
Spark | TITLE | 0.99+ |
Novetta | ORGANIZATION | 0.99+ |
two weeks | QUANTITY | 0.99+ |
tomorrow | DATE | 0.99+ |
two things | QUANTITY | 0.98+ |
today | DATE | 0.98+ |
Spark Summit 2017 | EVENT | 0.98+ |
several years ago | DATE | 0.97+ |
Hadoop | TITLE | 0.97+ |
ORGANIZATION | 0.97+ | |
about 15 years | QUANTITY | 0.96+ |
#SparkSummit | EVENT | 0.95+ |
billions of photos | QUANTITY | 0.95+ |
this morning | DATE | 0.91+ |
ML Lib | TITLE | 0.91+ |
billions | QUANTITY | 0.9+ |
one | QUANTITY | 0.87+ |
Mahoot | ORGANIZATION | 0.85+ |
one customer site | QUANTITY | 0.85+ |
Hadoop | DATE | 0.84+ |
two people | QUANTITY | 0.74+ |
CUBE | ORGANIZATION | 0.72+ |
Predictive Analytics | ORGANIZATION | 0.68+ |
next couple | DATE | 0.66+ |
Director | PERSON | 0.66+ |
years | DATE | 0.62+ |
Spark ML Lib | TITLE | 0.61+ |
Queen | TITLE | 0.59+ |
ML | TITLE | 0.57+ |
couple | QUANTITY | 0.54+ |
Red | OTHER | 0.53+ |
MapReduce | ORGANIZATION | 0.52+ |
Google Paper | ORGANIZATION | 0.47+ |
Dr. Jisheng Wang, Hewlett Packard Enterprise, Spark Summit 2017 - #SparkSummit - #theCUBE
>> Announcer: Live from San Francisco, it's theCUBE covering Sparks Summit 2017 brought to you by Databricks. >> You are watching theCUBE at Sparks Summit 2017. We continue our coverage here talking with developers, partners, customers, all things Spark, and today we're honored now to have our next guest Dr. Jisheng Wang who's the Senior Director of Data Science at the CTO Office at Hewlett Packard Enterprise. Dr. Wang, welcome to the show. >> Yeah, thanks for having me here. >> All right and also to my right we have Mr. Jim Kobielus who's the Lead Analyst for Data Science at Wikibon. Welcome, Jim. >> Great to be here like always. >> Well let's jump into it. At first I want to ask about your background a little bit. We were talking about the organization, maybe you could do a better job (laughs) of telling me where you came from and you just recently joined HPE. >> Yes. I actually recently joined HPE earlier this year through the Niara acquisition, and now I'm the Senior Director of Data Science in the CTO Office of Aruba. Actually, Aruba you probably know like two years back, HP acquired Aruba as a wireless networking company, and now Aruba takes charge of the whole enterprise networking business in HP which is about over three billion annual revenue every year now. >> Host: That's not confusing at all. I can follow you (laughs). >> Yes, okay. >> Well all I know is you're doing some exciting stuff with Spark, so maybe tell us about this new solution that you're developing. >> Yes, actually my most experience of Spark now goes back to the Niara time, so Niara was a three and a half year old startup that invented, reinvented the enterprise security using big data and data science. So what is the problem we solved, we tried to solve in Niara is called a UEBA, user and entity behavioral analytics. So I'll just try to be very brief here. Most of the transitional security solutions focus on detecting attackers from outside, but what if the origin of the attacker is inside the enterprise, say Snowden, what can you do? So you probably heard of many cases today employees leaving the company by stealing lots of the company's IP and sensitive data. So UEBA is a new solution try to monitor the behavioral change of the enterprise users to detect both this kind of malicious insider and also the compromised user. >> Host: Behavioral analytics. >> Yes, so it sounds like it's a native analytics which we run like a product. >> Yeah and Jim you've done a lot of work in the industry on this, so any questions you might have for him around UEBA? >> Yeah, give us a sense for how you're incorporating streaming analytics and machine learning into that UEBA solution and then where Spark fits into the overall approach that you take? >> Right, okay. So actually when we started three and a half years back, the first version when we developed the first version of the data pipeline, we used a mix of Hadoop, YARN, Spark, even Apache Storm for different kind of stream and batch analytics work. But soon after with increased maturity and also the momentum from this open source Apache Spark community, we migrated all our stream and batch, you know the ETL and data analytics work into Spark. And it's not just Spark. It's Spark, Spark streaming, MLE, the whole ecosystem of that. So there are at least a couple advantages we have experienced through this kind of a transition. The first thing which really helped us is the simplification of the infrastructure and also the reduction of the DevOps efforts there. >> So simplification around Spark, the whole stack of Spark that you mentioned. >> Yes. >> Okay. >> So for the Niara solution originally, we supported, even here today, we supported both the on-premise and the cloud deployment. For the cloud we also supported the public cloud like AWS, Microsoft Azure, and also Privia Cloud. So you can understand with, if we have to maintain a stack of different like open source tools over this kind of many different deployments, the overhead of doing the DevOps work to monitor, alarming, debugging this kind of infrastructure over different deployments is very hard. So Spark provides us some unified platform. We can integrate the streaming, you know batch, real-time, near real-time, or even longterm batch job all together. So that heavily reduced both the expertise and also the effort required for the DevOps. This is one of the biggest advantages we experienced, and certainly we also experienced something like the scalability, performance, and also the convenience for developers to develop a new applications, all of this, from Spark. >> So are you using the Spark structured streaming runtime inside of your application? Is that true? >> We actually use Spark in the steaming processing when the data, so like in the UEBS solutions, the first thing is collecting a lot of the data, different account data source, network data, cloud application data. So when the data comes in, the first thing is streaming job for the ETL, to process the data. Then after that, we actually also develop the some, like different frequency like one minute, 10 minute, one hour, one day of this analytics job on top of that. And even recently we have started some early adoption of the deep learning into this, how to use deep learning to monitor the user behavior change over time, especially after user gives a notice what user, is user going to access like most servers or download some of the sensitive data? So all of this requires very complex analytics infrastructure. >> Now there were some announcements today here at Spark Summit by Databricks of adding deep learning support to their core Spark code base. What are your thoughts about the deep learning pipelines, API, that they announced this morning? It's new news, I'll understand if you don't, haven't digested it totally, but you probably have some good thoughts on the topic. >> Yes, actually this is also news for me, so I can just speak from my current experience. How to integrate deep learning into Spark actually was a big challenge so far for us because what we used so far, the deep learning piece, we used TensorFlow. And certainly most of our other stream and data massaging or ETL work is done by Spark. So in this case, there are a couple ways to manage this, too. One is to set up two separate resource pool, one for Spark, the other one for TensorFlow, but in our deployment there is some very small on-premise department which has only like four node or five node cluster. It's not efficient to split resource in that way. So we actually also looking for some closer integration between deep learning and Spark. So one thing we looked before is called the TensorFlow on Spark which was open source a couple months ago by Yahoo. >> Right. >> So maybe this is certainly more exciting news for the Spark team to develop this native integration. >> Jim: Very good. >> Okay and we talked about the UEBA solution, but let's go back to a little broader HPE perspective. You have this concept called the intelligent edge, what's that all about? >> So that's a very cool name. Actually come a little bit back. I come from the enterprise background, and enterprise applications have some, actually a lag behind than consumer applications in terms of the adoption of the new data science technology. So there are some native challenges for that. For example, collecting and storing large amount of this enterprise sensitive data is a huge concern, especially in European countries. Also for the similar reason how to collect, normally weigh developer enterprise applications. You're lack of some good quantity and quality of the trending data. So this is some native challenges when you develop enterprise applications, but even despite of this, HPE and Aruba recently made several acquisitions of analytics companies to accelerate the adoption of analytics into different product line. Actually that intelligent age comes from this IOT, which is internet of things, is expected to be the fastest growing market in the next few years here. >> So are you going to be integrating the UEBA behavioral analytics and Spark capability into your IOT portfolio at HP? Is that a strategy or direction for you? >> Yes. Yes, for the big picture that certainly is. So you can think, I think some of the Gartner Report expected the number of the IOT devices is going to grow over 20 billion by 2020. Since all of this IOT devices are connected to either intranet or internet, either through wire or wireless, so as a networking company, we have the advantage of collecting data and even take some actions at the first of place. So the idea of this intelligent age is we want to turn each of these IOT devices, the small IOT devices like IP camera, like those motion detection, all of these small devices as opposed to the distributed sensor for the data collection and also some inline actor to do some real-time or even close to real-time decisions. For example, the behavior anomaly detection is a very good example here. If IOT devices is compromised, if the IP camera has been compromised, then use that to steal your internal data. We should detect and stop that at the first place. >> Can you tell me about the challenges of putting deep learning algorithms natively on resource constrained endpoints in the IOT? That must be really challenging to get them to perform well considering that there may be just a little bit of memory or flash capacity or whatever on the endpoints. Any thoughts about how that can be done effectively and efficiently? >> Very good question >> And at low cost. >> Yes, very good question. So there are two aspects into this. First is this global training of the intelligence which is not going to be done on each of the device. In that case, each of the device is more like the sensor for the data collection. So we are going to build a, collect the data sent to the cloud, or build all of this giant pool, like computing resource to trend the classifier, to trend the model, but when we trend the model, we are going to ship the model, so the inference and the detection of the model of those behavioral anomaly really happen on the endpoint. >> Do the training centrally and then push the trained algorithms down to the edge devices. >> Yes. But even like, the second as well even like you said, some of the device like say people try to put those small chips in the spoon, in the case of, in hospital to make it like more intelligent, you cannot put even just the detection piece there. So we also looking to some new technology. I know like Caffe recently announced, released some of the lightweight deep learning models. Also there's some, your probably know, there's some of the improvement from the chip industry. >> Jim: Yes. >> How to optimize the chip design for this kind of more analytics driven task there. So we are all looking to this different areas now. >> We have just a couple minutes left, and Jim you get one last question after this, but I got to ask you, what's on your wishlist? What do you wish you could learn or maybe what did you come to Spark Summit hoping to take away? >> I've always treated myself as a technical developer. One thing I am very excited these days is the emerging of the new technology, like a Spark, like TensorFlow, like Caffe, even Big-Deal which was announced this morning. So this is something like the first go, when I come to this big advanced industry events, I want to learn the new technology. And the second thing is mostly to share our experience and also about adopting of this new technology and also learn from other colleagues from different industries, how people change life, disrupt the old industry by taking advantage of the new technologies here. >> The community's growing fast. I'm sure you're going to receive what you're looking for. And Jim, final question? >> Yeah, I heard you mention DevOps and Spark in same context, and that's a huge theme we're seeing, more DevOps is being wrapped around the lifecycle of development and training and deployment of machine learning models. If you could have your ideal DevOps tool for Spark developers, what would it look like? What would it do in a nutshell? >> Actually it's still, I just share my personal experience. In Niara, we actually developed a lot of the in-house DevOps tools like for example, when you run a lot of different Spark jobs, stream, batch, like one minute batch verus one day batch job, how do you monitor the status of those workflows? How do you know when the data stop coming? How do you know when the workflow failed? Then even how, monitor is a big thing and then alarming when you have something failure or something wrong, how do you alarm it, and also the debug is another big challenge. So I certainly see the growing effort from both Databricks and the community on different aspects of that. >> Jim: Very good. >> All right, so I'm going to ask you for kind of a soundbite summary. I'm going to put you on the spot here, you're in an elevator and I want you to answer this one question. Spark has enabled me to do blank better than ever before. >> Certainly, certainly. I think as I explained before, it helped a lot from both the developer, even the start-up try to disrupt some industry. It helps a lot, and I'm really excited to see this deep learning integration, all different road map report, you know, down the road. I think they're on the right track. >> All right. Dr. Wang, thank you so much for spending some time with us. We appreciate it and go enjoy the rest of your day. >> Yeah, thanks for being here. >> And thank you for watching the Cube. We're here at Spark Summit 2017. We'll be back after the break with another guest. (easygoing electronic music)
SUMMARY :
brought to you by Databricks. at the CTO Office at Hewlett Packard Enterprise. All right and also to my right we have Mr. Jim Kobielus (laughs) of telling me where you came from of the whole enterprise networking business I can follow you (laughs). that you're developing. of the company's IP and sensitive data. Yes, so it sounds like it's a native analytics of the data pipeline, we used a mix of Hadoop, YARN, the whole stack of Spark that you mentioned. We can integrate the streaming, you know batch, of the deep learning into this, but you probably have some good thoughts on the topic. one for Spark, the other one for TensorFlow, for the Spark team to develop this native integration. Okay and we talked about the UEBA solution, Also for the similar reason how to collect, of the IOT devices is going to grow natively on resource constrained endpoints in the IOT? collect the data sent to the cloud, Do the training centrally But even like, the second as well even like you said, So we are all looking to this different areas now. And the second thing is mostly to share our experience And Jim, final question? If you could have your ideal DevOps tool So I certainly see the growing effort All right, so I'm going to ask you even the start-up try to disrupt some industry. We appreciate it and go enjoy the rest of your day. We'll be back after the break with another guest.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jim | PERSON | 0.99+ |
HPE | ORGANIZATION | 0.99+ |
HP | ORGANIZATION | 0.99+ |
10 minute | QUANTITY | 0.99+ |
one hour | QUANTITY | 0.99+ |
one minute | QUANTITY | 0.99+ |
Wang | PERSON | 0.99+ |
San Francisco | LOCATION | 0.99+ |
Yahoo | ORGANIZATION | 0.99+ |
Jisheng Wang | PERSON | 0.99+ |
Niara | ORGANIZATION | 0.99+ |
first version | QUANTITY | 0.99+ |
one day | QUANTITY | 0.99+ |
two aspects | QUANTITY | 0.99+ |
Jim Kobielus | PERSON | 0.99+ |
Hewlett Packard Enterprise | ORGANIZATION | 0.99+ |
First | QUANTITY | 0.99+ |
Caffe | ORGANIZATION | 0.99+ |
Spark | TITLE | 0.99+ |
Spark | ORGANIZATION | 0.99+ |
one | QUANTITY | 0.99+ |
each | QUANTITY | 0.99+ |
three and a half year | QUANTITY | 0.99+ |
both | QUANTITY | 0.99+ |
Sparks Summit 2017 | EVENT | 0.99+ |
first | QUANTITY | 0.99+ |
DevOps | TITLE | 0.99+ |
2020 | DATE | 0.99+ |
second thing | QUANTITY | 0.99+ |
Aruba | ORGANIZATION | 0.98+ |
Snowden | PERSON | 0.98+ |
two years back | DATE | 0.98+ |
first thing | QUANTITY | 0.98+ |
one last question | QUANTITY | 0.98+ |
AWS | ORGANIZATION | 0.98+ |
over 20 billion | QUANTITY | 0.98+ |
one question | QUANTITY | 0.98+ |
UEBA | TITLE | 0.98+ |
today | DATE | 0.98+ |
Spark Summit | EVENT | 0.97+ |
Microsoft | ORGANIZATION | 0.97+ |
Spark Summit 2017 | EVENT | 0.96+ |
Apache | ORGANIZATION | 0.96+ |
three and a half years back | DATE | 0.96+ |
Databricks | ORGANIZATION | 0.96+ |
one day batch | QUANTITY | 0.96+ |
earlier this year | DATE | 0.94+ |
Aruba | LOCATION | 0.94+ |
One | QUANTITY | 0.94+ |
#SparkSummit | EVENT | 0.94+ |
One thing | QUANTITY | 0.94+ |
one thing | QUANTITY | 0.94+ |
European | LOCATION | 0.94+ |
Gartner | ORGANIZATION | 0.93+ |
Octavian Tanase, Netapp - #SparkSummit - #theCUBE
(upbeat music) >> Announcer: Live from San Franciso, it's theCUBE covering the Spark Summit 2017. Brought to you by Databricks. >> You are watching theCUBE at Spark Summit 2017 I'm David Goad here with my friend George Gilbert. How you doing, George? >> Good. >> All right, but the man of the hour is over to my left. I'd like to introduce a Databricks partner, and his name is Octavian Tanase, he's the SVP for Data ONTAP Software and Systems Group at NetApp. Octavian. >> Thank you for having us. >> All right well you have kind of an interesting background. We were chatting before, you started as an engineer, developer? >> Yeah, so I'm in an executive role right now but I have an interesting trajectory. Most people in a similar role come from a product management or sales background. I'm a former engineer and you know, somebody that has a passion for technology and now for customers and building interesting technologies. >> Okay, well if you have a passion for this technology then, I'd like to get your take on the market place a little bit. Tell us about the evolution of the mainstream and what you see changing. >> I think your data is the new currency of the 21st century. You have a desire and a thirst to get more out of your data. You have developers, you have analysts looking to build the next great application to mine your data for great business outcomes. NetApp as a data management company is very much interested in working with companies like Databricks and a bunch of hyperscalers to enable that type of solutions that either enable in place analytics or data lakes or you know, solutions that really enable developers and analysts to harness that part of the data. >> Mhmm. So ... Maybe walk us through what you've seen to date in terms of the mainstream use cases for big data and then tell us where you think they're going, but what walls need to be pushed back with the confection of technologies to get there. >> Originally what I've seen a lot of people investing in data lake technologies. Data lakes in a nutshell are massive containers that are simple to manage, scalable performant where you can aggregate a bunch of data sources and then you can run a map-produced type of workload to correlate that data, to harness that part of data, to draw conclusions. That was sort of the original track. Over time, I think there's a desire, given how dynamic and diverse that the data is, to build a lot of this analytics in-line, in real time. That's where companies like Databricks comes and that's where the cloud comes to enable both the agility as well as the type of real time behavior to getting those analytics. >> Now this is your first Spark Summit? >> Absolutely, happy to be here. >> Oh I know it's just the first day, but what have you learned so far? Any great questions from other participants? >> Well I think I see a lot of people innovating very fast. I see both established players paying attention, I see new companies looking to take advantage of this revolution that is happening, you know, around data and the data services and data analytics. >> Maybe tell us a little more what we were talking about before we started about how some customers who are very sensitive to their data want to keep it in their data centers or Equinix which still counts as pretty much theirs, but the compute is often the cloud somewhere. >> As you can imagine, we work with a lot of enterprise customers and one thing that I've learned in the last couple of years is that their thought process has evolved, you know, banks, large financial institutions. Two years ago, we're not even considering the cloud. And I see that now changing and I see them wanting to operate like a cloud provider, I see them want to take advantage of the flexibility and the agility of the cloud. I see them being more comfortable with the type of security capabilities that the cloud offers today. Security has been probably the most troublesome issue that folks have looked to overcome and then the gravity of the data. The reality is that the data, it's very distributed in dynamic, diverse in nature as I mentioned earlier. There's data created at the edge, data created in the data center, and people want to be able to process that data in real time regardless where data is without necessarily having to move it in some cases. Everybody's looking for data management solutions that enable mobility, you know, governments, management of that data and this enabling analytics, wherever that data is. >> You said some really interesting things in there which is, I mean I can see where the customer's data center extended to Equinix, where they want to bring the compute to the data because the data's heavier than the compute, but what about on the edge? Does it make sense to bring, is there enough data there to keep it there and bring compute down to the edge or do you co-locate compute persistently? And then how much of the compute is done at the edge? >> The reality is that you're probably going to see customers do both. There is more data created at the edge than in the history before. You'll see a lot of the data management companies invest in software-defined solutions that require a very small footprint, both from the storage point of view as well as compute. One of the advantages of technology like ONTAP is the investment that has been made to enable data reduction because your ability to store data at the edge is not really very good, so you want to have these capabilities to reduce the footprint by compression, by deduping, by compacting that data, and then making some smart decisions at the edge. Perhaps do some in-line, in-place analytics there and moving some of the data back into a central data center where more batch analytics can take place. >> But when you talk about that compaction, deduping, there was one more, but I think everyone gets the point. Are you talking about having a NetApp ONTAP device near the edge or on the edge? >> That device, it's actually software only. >> Ahh. >> You guys probably are aware of the fact that ONTAP now ships in three flavors, or three form factors. There is an engineered appliance, and we will likely do that for many years to come. But we also have ONTAP running in a virtual environment, either on KVM or Vmware as well as ONTAP running in the cloud. We've been running in the AWS cloud since 2014. We're also running in the Azure cloud. We are talking to other vendors to improve the ubiquity of software-defined ONTAP. >> Just to be really specific, we're told now that an edge gateway, not an edge device, but gateway, it's about two gigs in memory and two cores. Is that something a software-defined ONTAP would run on? >> Absolutely. You'll see us running on a variety of devices in the field with energy companies. You'll see ONTAP running in the tactical sphere, and we have projects that I can't really tell you about, but you'll find it broadly deployed on the edge. >> George: Okay. >> Yeah, let's talk a little bit about NetApp. What are some of the business outcomes you're looking for here? Do you have good executive sponsorship of these initiatives? >> We are very excited to be here. NetApp has been in the data management realm for a very, very long time. Yeah, analytics is a natural place, a great adjacency for us. We've been very fortunate to work with NoSQL type of companies. We've been very happy to collaborate with some of the leaders in analytics such as Databricks. We are entering the IOT space and enabling solutions that are really edge focused. So overall, this is a great fit for us and we're very excited to participate at the Summit. >> What do you think will be ... We've heard from Mata that sort of the state of the art in terms of, I hate to say the word, its fantasy, but like experimentation perhaps, is structured streaming, so continuous apps which are calling on deep learning models. Where would you play in that and what do you think ... What are the barriers there? What comes next? >> I think any complete analytics solution will need a bunch of services and some infrastructure that lends itself for that type of a workload, that type of a use case so you need, in some cases, very fast storage with super low latencies. In some cases you will need tremendous throughput. In some cases you will need that small footprint of an operating system running at the edge to enable some of that in-line processing. I think the market will evolve very fast. The solutions will evolve very fast and you will need the type of industry sponsorship by companies that really understand data management and that have made it their business for a very, very long time. I see that synergy that is being created between the innovation in analytics, the innovation that happens in the cloud, and the innovation that a company like NetApp does around a data fabric and around the type of services that are required to govern, to move, to secure, to protect that data in a very cost efficient way. >> This is kind of key, because people are struggling with having some sort of commonality in their architecture between the edge, on PRAM, and the cloud, but it could be at many different levels. What's your sweet spot for offering that? I mean, you talked about deduping and ... >> Compression and compaction. >> Compression and snapshots or whatever. Having that available in different form factors, what does that enable a customer to do, perhaps using different software on top? >> I'm glad that you asked. The reality is that we want to enable customers to consolidate both second and third platform applications on the ONTAP operating system. Customers will find not only flexibility, but consistency on the data management regardless of where data is. Whether it's in the cloud, near the cloud, or on the edge. We believe that we have the most flexible solution to enable data analytics, data management, that lends itself for all these use cases that enable next generation type of applications. >> Okay but if that predicated on having not just data ONTAP, but also a common application architecture on top? >> I think we wanted to enable a variety of solutions being based there. In some cases we're building glue. What do I mean by glue? It's for example, an NFS to HDFS connector that enable that translation from the native format for most of the data in a Hadoop or Spark type of EMR system. We're investing in enabling that flexibility and enabling that innovation that would happen by many of the companies that we see here on the floor today. >> George: Okay, that makes sense. >> We have just a minute to go here before the break. If you could talk to the entire Spark community, and you are right now on theCUBE, what's on your wish list? What do you wish people would do more of? Or if you could get help with something, what would it be? >> I think that my ask is continue to innovate. Push boundaries, and continue to be clever in partnering both with small vendors that are really innovating with tremendous space, as well as with established vendors that have really made the data management their business for many years and then are looking to participate in the ecosystem. >> Let's innovate together. >> All right, very good. >> Octavian, thank you so much for taking some time here out of your busy day to share with theCUBE, and we appreciate you being here >> Very good. >> Thank you so much. >> Pleasure >> Thanks, Octavian. >> That's right, you're watching theCUBE here at Spark Summit 2017. We'll see you in a few minutes with our next guest. (upbeat electronic music)
SUMMARY :
Brought to you by Databricks. How you doing, George? All right, but the man of the hour is over to my left. All right well you have kind of an interesting background. I'm a former engineer and you know, and what you see changing. the next great application to mine your data and then tell us where you think they're going, given how dynamic and diverse that the data is, around data and the data services and data analytics. but the compute is often the cloud somewhere. The reality is that the data, it's very distributed and moving some of the data back into a central data center near the edge or on the edge? You guys probably are aware of the fact that ONTAP Is that something a software-defined ONTAP would run on? and we have projects that I can't really tell you about, What are some of the business outcomes NetApp has been in the data management realm We've heard from Mata that sort of the state of the art that type of a use case so you need, in some cases, between the edge, on PRAM, and the cloud, Having that available in different form factors, I'm glad that you asked. for most of the data in a Hadoop and you are right now on theCUBE, that have really made the data management We'll see you in a few minutes with our next guest.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
David Goad | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Octavian Tanase | PERSON | 0.99+ |
Octavian | PERSON | 0.99+ |
two cores | QUANTITY | 0.99+ |
21st century | DATE | 0.99+ |
San Franciso | LOCATION | 0.99+ |
ONTAP | TITLE | 0.99+ |
first day | QUANTITY | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
both | QUANTITY | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Two years ago | DATE | 0.99+ |
2014 | DATE | 0.99+ |
one thing | QUANTITY | 0.99+ |
first | QUANTITY | 0.99+ |
One | QUANTITY | 0.98+ |
Equinix | ORGANIZATION | 0.98+ |
Spark Summit 2017 | EVENT | 0.98+ |
three flavors | QUANTITY | 0.98+ |
third platform | QUANTITY | 0.98+ |
Data ONTAP Software and Systems Group | ORGANIZATION | 0.98+ |
one | QUANTITY | 0.96+ |
Spark | TITLE | 0.96+ |
NetApp | TITLE | 0.96+ |
NetApp | ORGANIZATION | 0.95+ |
today | DATE | 0.95+ |
second | QUANTITY | 0.94+ |
three form factors | QUANTITY | 0.93+ |
last couple of years | DATE | 0.93+ |
Spark Summit | EVENT | 0.91+ |
Azure cloud | TITLE | 0.9+ |
Mata | PERSON | 0.88+ |
about two gigs | QUANTITY | 0.85+ |
Netapp | ORGANIZATION | 0.79+ |
ONTAP | ORGANIZATION | 0.72+ |
NoSQL | TITLE | 0.72+ |
theCUBE | ORGANIZATION | 0.7+ |
Hadoop | TITLE | 0.56+ |
#SparkSummit | TITLE | 0.34+ |
Matei Zaharia, Databricks - #SparkSummit - #theCUBE
>> Narrator: Live from San Francisco, it's theCUBE. Covering Spark Summit2017, brought to you by Databricks. (upbeat music) >> Welcome back to Spark Summit 2017, you're watching theCUBE and we have an honored guest here today, his name is Matei Zaharia and Matei is the creator of Spark, Chief Technologist, and Co-Founder of Databricks, did I get all that right? >> Yeah, thanks a lot for having me again. Excited to be here. >> Yeah Matei we were watching your keynote this morning and we're all excited to hear about better support for deep learning, about some of the structured streaming apps now being in production. I want to ask you what happened after the keynote? What kind of feedback have you heard from people in the hallways? >> Yeah definitely, so the feedback has definitely been super positive. I think people really like the direction that we're moving in with Apache Spark and with this library, such as a deep learning pipeline one. So we've gotten a lot of questions about the deep learning library, when will it support more types and so on. It's really good at supporting images right now. And also with streaming, I think people are just excited to try out the low latency streaming. >> Any other priorities people asked you about that maybe you haven't focused on yet? >> That I haven't focused on in the keynote, so I think that's a good question, I think overall some of the things we keep seeing are people just want to make it easier to just operate Spark on it under that scale and simplify things like monitoring and debugging and so on, so that's a constant theme that we're seeing. And then another thing that's generally been going on, I didn't focus on it this time, is increasing usage by Python and R users. So there's a lot of work in the latest release to continue improving that, to make it easier to use in those languages. >> Okay, we were watching the demo, the impressive demos, this morning, in fact George was watching the keynote, he was the one millisecond latency, he said wow. George, you want to ask a little more about that? >> So yeah let's talk about, 'cause there's this rise of continuous apps, which I think you guys named. >> Matei: Yeah. >> And resonates with everyone to go along with batch and request response. And in the past, so people were saying, well Spark was doing many micro batches, latency was couple hundred milliseconds. So now that you're down at one millisecond, what does that change in terms the class of apps that you're appropriate for or you know, some people have talked about criticality of vent processing. Where is Spark on that now? >> Yeah definitely, so yeah, so the goal of this is exactly to support the full range of latency, possible all the way down to sub-millisecond latency. And give users the same programming model for them so they don't have to use a different system or a lower level programming model to get that low latency. And so basically since we began structured streaming, we moved, we tried to make sure the API is not tied in with micro-batching in anyway. And so this is the next step to actually eliminate that from the engine and be able to execute these computations. And what are the new applications? So I think this really enables two types of things we've seen. One is kind of automated decision making system, so this would be something, it could be even on say, a website or you know, say when someone's applying for a loan or something like that, could be making decisions but it could even be an even lower latency, like say stock market style of place or internet of things, or like industrial monitoring, and making decisions there. That's one thing. And then the other thing we see people doing is a lot of kind of stream to stream ETL, which is a bit more boring in some way, but as you set that up, it's nice to have this very low latency transformations that can produce new streams from an existing one, because then nothing downstream from them is effected in terms of latency. >> So in this last example, it's sort of to help build microservice type applications. >> Yeah, exactly, yeah. Well in general, there's this whole, basically this whole architecture of saying all my data will be streamed and then I'll have some applications that just produce a new stream. And then later that stuff can go into a data link or into a real time system or whatever. So it's basically keeping it low latency while it remains in stream form. >> So we were talking earlier and we've been talking to the Snappy Data folks at the place machine folks. And they built Spark into a DBMS. So that like it's immutable. I'm sorry, mutable. >> Matei: Mutable, yeah. >> Like a data frame is updateable. So what does that make possible, even if you can do the same things with Spark, without it? What does it make easier? >> So that's also in the same spirit of continuous applications, it's saying you should have a single programming model and interface for doing both your transactional work and your analytics after, and then maybe serving the results of the analytics. So that makes a lot of sense and an example of that would be, you know, I keep going back to say the financial or credit card type of use cases, but it would be something where users are conducting transactions and maybe you learn stuff about them from that. You say okay, here's where they're located, now here's what they're purchasing, whatever. And then you also want to know, I'll have to make a decision. For example, do I allow them to go past the limit on their credit card or something like that. Or is this a normal use of it or is this a fraudulent one? So that's where it helps to integrate these and you can do these things. So there are products like Snappy Data That integrate a specific database with Spark. And we're also trying to make sure in Spark, the API, so that people can integrate their own system, whatever database or key value store they want. >> So would you have to jump through hoops if you didn't want to integrate any other store other than talking to a file system, or? >> Yeah if you want to do these transactions on a file system, there will be basically some performance constraints to doing that. It depends on the weight, it's definitely the simplest thing and if you have a low enough rate of up data it could actually be fine. But if you want more fine grained ones, then it becomes a problem. >> It would seem like if you tack on a product for ingest, not that you really want to get into that, think Kafka, which could also stretch into the transforms on some basic analytics. And you mentioned, I think on the Spark East keynote, Redis for serving, you've got like now a multi sort of vendor product stack. And so there's complexity to that. >> Matei: Yeah definitely yeah. >> Do you foresee a scenario where you could see that as a high volume solution and it's something that you would take ownership of? >> I see, so well, do you mean from the Apache Spark side or from the Databricks side? >> George: Actually either. >> Yeah so I think from the Spark side, basically so far the project doesn't provide storage, it just provides computation and it plugs into different storage engines. And so it would be kind of a big shift, it might be possible, but it would be kind of a big shift to say, okay well also provide persistent storage. I think the more likely thing that will happen is better and better integrations with the most widely used open source storage systems. So Redis is one. Apache Kafka, there's a lot of work on integrating that better and so on. From the Databricks side, that is different because that is a fully managed cloud service and it definitely makes sense there that'd you have a turnkey solution for that. Right now we actually built that for people who want that we can build it, sometimes with other vendors or with just services built into Amazon, but that makes a lot of sense. >> And Matei, something I read a press release on, but I didn't hear it in the keynote this morning. I hate to steal thunder from tomorrow, but can you give us a sneak preview on serverless apps? What's that about? >> Yeah, so this is actually we put out a press release today and we'll actually put out, well we'll have a full keynote tomorrow morning and also a lot more details on our website. So this is a Databricks serverless. It's basically a serverless platform for adding Apache Spark and data science. So not to steal away too much thunder, but you know serverless computing is this idea of users can just submit query or computation, they don't have to configure the hardware at all and they just get high performance and they get results. And so far it's been very successful with stateless workloads such as Sequel or Amazon Lambda, which is you know just functions serving a webpage or something like that. So this is going to be the first offering that actually extends that model to data science and in general to Spark workloads. So you can have machine learning users, you can have these streaming applications, all these things, on that kind of environment. So yeah, we'll have a lot more detail on that tomorrow, it's something that we're excited about. >> I want to circle back to IoT apps. You know there's sort of, beyond an emerging consensus, that we're going to do a lot of training in the cloud 'cause we have access to big compute and lots of data. But then the issue on the edge, in the near to medium term, the footprint, like a lot of people are telling us high volume devices will have 3 megs of memory and a gateway server would have like two gigs and two cores. So can you carve Spark up into fitting on one of the... >> That's a good question, I think for that, it's again, the most likely way that would happen is through data sources. For example, there are these projects like Apache knife and other projects as well that let you build up a data pipeline from IoT devices all the way to the cloud. And you can imagine some computation through those. So I think, yeah I don't have a very concrete answer, I think here it is something that's coming up a bunch though, so we do want to support this type of like splitting the computation. >> But in terms of splitting the computation, you could take a trained model, model training is fat compute and then the trained model-- >> You can definitely push the model and do inference. >> Would that inference thing have to happen in a Spark run time or could it be somewhere? >> I think it could happen anywhere else also. And actually like we do see a lot of people wanting to export basically machine learning pipelines or models from Spark into another environment. So it can happen somewhere else too. Yeah and then the other aspect of it is also data collection. So if you can push something that says here is when the data is exciting, like when the data is interesting you should remember these and send them on. That would also help, because otherwise you know, say it's like a video camera or something, most of the time it's looking at nothing. I mean you don't want to send all that back. >> That's actually a key point, which is some folks like especially in the IT ops area where you know, training wheels for IoT 'cause they're doing machine learning on infrastructure. >> Matei: Yeah which is there. >> Yeah, they say oh anything outside, two standard deviations of the band of exhortations, but there's more of an answer to that, I gather, from what you're saying. >> Yeah I mean I think you can create, for example, you can create a small machine learning model that decides whether what it's seeing is unusual and sends it back or you can even make it query specific, like you can count, like I want to find this type of object that's going by the camera. And try to find that. So I think there's a lot of room to improve that. >> Okay, well we have just a couple of minutes left here, want to draw into the future a little bit. And there's been some great progress since the summit last year to this one. What would you say is the next boundary that needs to be pushed to get Spark to the next level, whatever that may be? >> Yeah definitely yeah, well okay so again on the, so first of all in terms of the project today I think the big workload is that we are seeing come up all the time, are deep learning and stream processing. These are the big emerging ones. I mean there's still a lot of data warehousing, ETL and so on, that's still there. But these are the new ones, so that's what we're focusing on on our team at least. And we'll continue building out the stuff that you saw announced today. I think beyond that, I do think that part of the problem and this is more on the Databricks side, part of the problem is also just making it much easier for teams or businesses to begin using these technologies at all. And that's where we think cloud computing or software as a service is the way because you just turn it on and you can immediately start doing things. But that's basically, the way that I view that, is right now the barrier to do any project with data science or machine learning, or even like simple kind of analytics and unstructured data, the barrier is really high. So companies can only do it on a few projects. There might be like a 100 things they could be trying, but they can only afford to spend up two or three of them. So if you lower that barrier, there'll be a lot more of them and everyone will be able to quickly try one of these applications and see whether it actually works. >> And this ties into some of you graduate studies, like with model management and things like that? >> Yeah, so on the research side. So I'm also you know, doing research at Stanford and on that side we have this lab called Dawn, which is about usable machine learning. It's exactly these things. Like how do you enable an order of magnitude of more people to try to do things with machine learning. So actually we're also doing the video push down thing I mentioned, that's one thing we're looking at. A bunch of other stuff as well. >> Matei we could talk to you all day, but we don't have all day. We're up against the break here, but I wanted to thank you very much for coming and sharing a few moments here and look forward to seeing you in the hallways here at Spark right? >> Yeah thanks again for having me. >> Thanks for joining us and thank you all for watching, here we are at theCUBE at Spark 2017, thanks for watching. (upbeat music)
SUMMARY :
Covering Spark Summit2017, brought to you by Databricks. Excited to be here. I want to ask you what happened after the keynote? Yeah definitely, so the feedback has definitely That I haven't focused on in the keynote, George, you want to ask a little more about that? of continuous apps, which I think you guys named. And in the past, so people were saying, And so this is the next step to actually eliminate So in this last example, it's sort of to help build So it's basically keeping it low latency So that like it's immutable. even if you can do the same things with Spark, And then you also want to know, the simplest thing and if you have a low for ingest, not that you really want to get into that, and it definitely makes sense there that'd you have I hate to steal thunder from tomorrow, but can you give us So you can have machine learning users, So can you carve Spark up into fitting on And you can imagine some computation through those. You can definitely push the model So if you can push something that says like especially in the IT ops area where you know, but there's more of an answer to that, I gather, Yeah I mean I think you can create, for example, What would you say is the next boundary So if you lower that barrier, there'll be a lot So I'm also you know, doing research at Stanford and look forward to seeing you in the hallways Thanks for joining us and thank you all for watching,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George | PERSON | 0.99+ |
Matei | PERSON | 0.99+ |
Matei Zaharia | PERSON | 0.99+ |
one millisecond | QUANTITY | 0.99+ |
two gigs | QUANTITY | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
3 megs | QUANTITY | 0.99+ |
two cores | QUANTITY | 0.99+ |
tomorrow morning | DATE | 0.99+ |
three | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
tomorrow | DATE | 0.99+ |
100 things | QUANTITY | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Python | TITLE | 0.99+ |
Spark | TITLE | 0.99+ |
last year | DATE | 0.99+ |
two | QUANTITY | 0.98+ |
San Francisco | LOCATION | 0.98+ |
Spark Summit 2017 | EVENT | 0.98+ |
two types | QUANTITY | 0.98+ |
Spark | ORGANIZATION | 0.98+ |
One | QUANTITY | 0.98+ |
both | QUANTITY | 0.98+ |
Apache | ORGANIZATION | 0.97+ |
Stanford | ORGANIZATION | 0.97+ |
first offering | QUANTITY | 0.97+ |
one thing | QUANTITY | 0.96+ |
this morning | DATE | 0.96+ |
couple hundred milliseconds | QUANTITY | 0.95+ |
Lambda | TITLE | 0.94+ |
Spark Summit2017 | EVENT | 0.93+ |
one | QUANTITY | 0.89+ |
two standard | QUANTITY | 0.87+ |
#theCUBE | ORGANIZATION | 0.81+ |
single programming model | QUANTITY | 0.8+ |
Databricks | PERSON | 0.78+ |
R | TITLE | 0.78+ |
Snappy Data | ORGANIZATION | 0.77+ |
of minutes | QUANTITY | 0.67+ |
first | QUANTITY | 0.66+ |
Spark East | ORGANIZATION | 0.63+ |
Kafka | TITLE | 0.62+ |
Apache Spark | TITLE | 0.61+ |
Sequel | TITLE | 0.6+ |
Spark 2017 | EVENT | 0.58+ |
Narrator: | TITLE | 0.57+ |
theCUBE | ORGANIZATION | 0.56+ |
Redis | TITLE | 0.55+ |
Redis | ORGANIZATION | 0.5+ |
theCUBE | TITLE | 0.46+ |
#SparkSummit | TITLE | 0.35+ |