Sean Knapp, Ascend.io & Jason Robinson, Steady | AWS Startup Showcase

(upbeat music) >> Hello and welcome to today's session, theCUBE's presentation of the AWS Startup Showcase, New Breakthroughs in DevOps, Data Analytics, Cloud Management Tools, featuring Ascend.io for the data and analytics track. I'm your host, John Furrier with theCUBE. Today, we're proud joined by Sean Knapp, CEO and founder of Ascend.io and Jason Robinson who's the VP of Data Science and Engineering at Steady. Guys, thanks for coming on and congratulations, Sean, for the continued success, loves our cube conversation and Jason, nice to meet you. >> Great to meet you. >> Thanks for having us. >> So, the session today is really kind of looking at automating analytics workloads, right? So, and Steady as a customer. Sean, talk about the relationship with the customer Steady. What's the main product, what's the core relationship? >> Yeah, it's a really great question. when we work with a lot of companies like Steady we're working hand in hand with their data engineering teams, to help them onboard onto the Ascend platform, build these really powerful data pipelines, fueling their analytics and other workloads, and really helping to ensure that they can be successful at getting more leverage and building faster than ever before. So we tend to partner really closely with each other's teams and really think of them even as extensions of each other's own teams. I watch in slack oftentimes and our teams just go back and forth. And it's like, as if we were all just part of the same company. >> It's a really exciting time, Jason, great to have you on as a person cutting your teeth into this kind of what I call next gen data as intellectual property. Sean and I chat on theCUBE conversation previous to this event where every company is a data company, right? And we've heard that cliche. >> Right. >> But it's true, right? It's going to, it's getting more powerful with the edge. You seeing more diverse data, faster data, small, big, large, medium, all kinds of different aspects and patterns. And it's becoming a workflow kind of intellectual property paradigm for companies, not so much. >> That's right. >> Just the tech it's the database is you can, it's the data itself, data in flight, it's moving around, it's got value. What's your take-- >> Absolutely. >> On this trend? >> Basically, Steady helps our members and we have a community of members earn more income. So we want to help them steady their financial lives. And that's all based on data, so we have a web app, you could go to the iOS Store, you could go to the Google Play Store, you can download the app. And we have a large number of members, 3 million plus, who are actively using this. And we also have a very exciting new product called income passport. And this helps 1099 and mixed wage earners verify their income, which is very important for different government benefits. And then third, we help people with emergency cash grants as well as awards. So all of that is built on a bedrock of data, so if you're using our apps, it's all data powered. So what you were mentioning earlier from pipelines that are running it real time to yeah, anything, that's a kind of a small data aggregation, we do everything from small to real-time and large. >> You guys are like a multiple sided marketplace here, you've got it, you're a FinTech app, as well as the future of work and with virtual space-- >> That's right. >> Happening now, this is becoming, actually encapsulates kind of the critical problems that people trying to solve right now, you've got multiple stakeholders. >> That's right. >> In the data. >> Yes, we absolutely do. So we have our members, but we also, within the company, we have product, we have strategy, we have a growth team, we have operations. So data engineering and data science also work with a data analytics organization. So at Steady we're very much a data company. And we have a data organization led by our chief data officer and we have data engineering and data science, which are my teams, but also that business insights and analytics. So a lot of what we're building on the data engineering side is powering those insights and analytics that the business stakeholders use every day to run the organization. >> Sean, I want to get your thoughts on this because we heard from Emily Freeman in the keynote about how this revolution in DevOps or for premiering her talk around how, it's not just one persona anymore, I'm a release engineer, I'm this kind of engineer, you're seeing now all engineering, all developers are developers. You have some specialty, but for the most part, the team makeups are changing. We touched on this in our cube conversation. The journey of data is not just the data people, the data folks. It's like there's, they're developers too. So the confluence of data science, data management, developing, is changing the team and cultural makeup of companies. Could you share your thoughts on this dynamic and how it impacts customers? >> Absolutely, I think the, we're finding a similar trend to what we saw a number of years ago, when we talked about how software was eating the world and every company was now becoming a software company. And as a result, we saw this proliferation and expansion of what the software roles look like and thought of a company pulled through this entire new era of DevOps. We were finding that same pattern now emerging around data as not only is every company a software company, every company is a data company and data really is that field, that oil that fuels the business and in doing so, we're finding that as Jason describes it's pervasive across the team, it is no longer just one team that is creating some insights and reports around operational analytics, or maybe a team over here doing data science or machine learning. It is expensive. And I think the really interesting challenges that start to come with this too, are so many data teams are so over capacity. We did a recent study that highlighted that 96% of data teams are at, or over capacity, only 4% had spare capacity. But as a result, the net is being cast even wider to pull in people from even broader and more adjacent domains to all participate in the data future of their organization. >> Yeah, and I think I'd love to get your guys react to this conversation with Andy Jassy, who's now the CEO of Amazon, but when he was the CEO of AWS last year, I talked with him about how the old guard and new guard are thinking around team formations. Obviously team capacity is growing and challenged when you've got the right formula. So that's one thing, right? But what if you don't have the right formula? If you're in the skills gap, problem, or team formation side of it, where you maybe there was two years ago where the mandate came down? Well, we got to build a data team even in two years, if you're not inquisitive. And this is what Andy and I were talking about is the thinking and the mindset of that mission and being open to discovering and understanding the changes, because if you were deciding what your team was two, three years ago, that might have changed a lot. So team capacity, Sean, to your point, if you got it right, and that's a challenge in and of itself, but what if you don't have it, right? What do you guys think about this? >> Yeah, I think that's exactly right. Basically trying to see, look and gaze into the crystal ball and see what's going to happen in a year or two years, even six months is quite difficult. And if you don't have it right, you do spend a lot of time because of the technical debt that you've amassed. And we certainly spend quite a bit of time with technical debt for things we wanted to build. So, deconvolving that, getting those ETLs to a runnable state, getting performance there, that's what we spend a bit of time on. And yeah, it's something that it's really part of the package. >> What do you guys see as the big challenge on teams? The scaling challenge okay. Formation is one thing, Sean, but like, okay, getting it right, getting it formed properly and then scaling it, what are the big things you're seeing? >> One of the, I think the overarching management themes in general, it is the highest out by the highest performing teams are those where the individual with the context and the idea is able to execute as far and as fast and as efficiently as possible, and removing a lot of those encumbrances and put it a slightly different way. If DevOps was all basically boiled down to, how do we help more people write more software faster and safely data ops would be very similarly, how do we enable more people to do more things with data faster and safely? And to do that, I think the era of these massive multi-year efforts around data are gone and hopefully in the not too distant future, even these multi-quarter efforts around data are gone and we get into a much more agile, nimble methodology where smaller initiatives and smaller efforts are possible by more diverse skillsets across the business. And really what we should be doing is leveraging technology and automation to ensure that people are able to be productive and efficient and that we can trust our data and that systems are automated. And these are problems that technology is good at. And so in many ways, how in the early days Amazon would described as getting people out of the muck of DevOps. I think we're going to do the same thing around getting people out of the muck of the data and get them really focused on the higher level aspects. >> Yeah, we're going to get into that complexity, heavy lifting side muck, and then the heavy lifting taking away from the customers. But I want to go back to real quick with Jason while we're on this topic. Jason, I was just curious, how much has your team grown in the recent year and how much could've, should've grown, what's the status and how has Ascend helped you guys? What's the dynamic there? ' Cause that's their value proposition. So, take us through that. >> Absolutely, so, since the beginning of the year data engineering has doubled. So, we're a lean team, we certainly use the agile mindset and methodologies, but we have gone from, yeah, we've essentially doubled. So a lot of that is there's just so much to do and the capacity problem is certainly there. So we also spend a lot of time figuring out exactly what the right tooling is. And I was mentioning the technical debt. So you have those, there's the big O notation of whatever that involves technical debt. And when you're building new things, you're fixing old things. And then you're trying to maintain everything. That scaling starts to hit hard. So even if we continue to double, I mean, we could easily add more data engineers. And a lot of that is, I mean, you know about the hiring cycles, like, a lot of of great talent, but it's difficult to make all of those hires. So, we do spend quite a bit of time thinking about exactly what tools data engineering is using day-to-day. And what I mentioned, were technologies on the streaming side all the way to like the small batch things, but, like something that starts as a small batch getting grow and grow and grow and take, say 15 hours, it's possible, I've seen it. But, and getting that back down and managing that complexity while not overburdening people who probably don't want to spend all their waking hours building ETLs, maintaining ETL, putting in monitoring, putting in alerting, that I think is quite a challenge. >> It's so funny because you mentioned 18 hours, you got to kind of being, you didn't roll your eyes, but you almost did, but this is, but people want it yesterday, they want real time, so there's a lot of demand-- >> Yes. >> On the minds of the business outcome side of it. So, I got to ask you, because this comes up a lot with technical debt, and now we're starting to see that come into the data conversation. And so I always curious, is there a different kind of technical debt with data? Because again, data is like software, but it's a little bit of more elusive in the sense it's always changing. So is there, what kind of technical debt do you see in the data side that's different than say software side? >> Absolutely, now that's a great question. So a lot of thinking about your data and structuring your data and how you want to use that data going into a particular project might be different from what happens after stakeholders have a new considerations and new products and new items that need to be built. So thinking about how that, let's say you have a document store, or you have something that you thought was going to be nice and structured, how that can evolve and support those particular products can essentially, unless you take the time and go through and say, well, let's architect it perfectly so that we can handle that. You're going to make trade-offs and choices, and essentially that debt builds up. So you start cutting corners, you start changing your normalization. You start essentially taking those implicit schema that then tend to build into big things, big implicit schema. And then of course, with implicit schema, you're going to have a lot of null values, you're going to have a lot of items to deal with. So, how do you deal with that? And then you also have the opportunity to create keys and values and oops, do we take out those keys that were slightly misspelled? So, I could go on for hours, but basically the technical debt certainly is there with on data. I see a lot of this as just a spectrum of technical debt, because it's all trade-offs that you made to build a product, and the efficiency has start to hit you. So, the 15 hour ETL, I was mentioning, basically you start with something and you were building things for stakeholders and essentially you have so much complex logic within that. So for the transforms that you're doing from if you're thinking of the bronze, silver, gold, kind of a framework, going from that bronze to a silver, you may have a massive number of transformations or just a few, just to lightly dust it. But you could also go to gold with many more transformations and managing that, managing the complexity, managing what you're spending for servers day after day after day. That's another real challenge of that technical debt stuff. >> That's a great lead into my next question, for Sean, this is the disparate system complexity, technical debt and software was always kind of the belief was, oh yeah, I'll take some technical debt on and work it off once I get visibility and say, unit economics or some sort of platform or tool feature, and then you work it off as fast as possible. I was, this becomes the art and science of technical debt. Jason, what you're saying is that this can be unwieldy pretty quickly. You got state and you got a lot of different inter moving parts. This is a huge issue, Sean, this is where it's, technical debt in the data world is much different architecturally. If you don't get it right, this is a huge, huge issue. Could you aluminate why that is and what you guys are doing to help unify and change some of those conditions? >> Yeah, absolutely. When we think about technical debt and I'll keep drawing some parallels between DevOps and data ops, 'cause I think there's a tremendous number of similarities in these worlds. We used to always have the saying that "Your tech debt grows manually across microservices, "but exponentially within services." And so you want that right level of architecture and composibility if you will, of your systems where you can deploy changes, you can test, you can have high degrees of competence and the roll-outs. And I think the interesting part in the data side, as Jason highlighted, the big O-notation for tech debt in the data ecosystem, is still fairly exponential or polynomial in nature. As right now, we don't have great decomposition of the components. We have different systems. We have a streaming system, we have a databases, we have documents, doors and so on, but how the whole data pipeline data engineering part works generally tends to be pretty monolithic in nature. You take your whole data pipeline and you deploy the whole thing and you basically just cross your fingers, and hopefully it's not 15 hours, but if it is 15 hours, you go to sleep, you wake up the next morning, grab a coffee and then maybe it worked. And that iteration cycle is really slow. And so when we think about how we can improve these things, right? This is combinations of intelligent systems that do instantaneous schema detection, and validation, excuse me, it's combinations of things that do instantaneous schema detection and validation. It's things like automated lineage and dependency tracking. So you know that when you deploy code, what piece of data it affects, it's things like automated testing on individual core parts of your data pipelines to validate that you're getting the expected output that you need. So it's pulling a lot of these same DevOps style principles into the data world, which is really designed to going back to how do you help more people build more things faster and safely really rapid iterations for rapid feedback. So you know if there's breaks in the system much earlier on. >> Well, I think Sean, you're onto something really big there. And I think this is something that's emerging pretty quickly in the cloud scale that I called, 2.0, whatever, what version we're in, is the systems thinking mindset. 'Cause you mentioned the model that that was essentially a silo or subsystem. It was cohesive in it's own way, but now it's been monolithic. Now you have a broken down set of decomposed sets of data pieces that have to work together. So Jason, this is the big challenge that everyone, not really people are talking about, I think most these guys are, and you're using them. What are you unifying? Because this is a systems operating systems thinking, this is not like a database problem. It's a systems problem applied to data where databases are just pieces of it, what's your thoughts? >> That's absolutely right. And I would, so Sean touched on composibility of ETL and thinking about reusable components, thinking about pieces that all fit together, because as you're building something as complex as some of these ETS are, we do think about the platform itself and how that lends to the overarching output. So one thing, being able to actually see the different components of an ETL and blend those in and you as the dry principal, don't repeat yourself. So you essentially are able to take pieces that one person built, maybe John builds a couple of our connectors coming in, Sean also has a bunch of transforms and I just want this stuff out, so I can use a lot of what you guys have already built. I think that's key, because a lot of engineering and data engineering is about managing complexity. So taking that complexity and essentially getting it out fast and getting out error free, is where we're going with all of the data products we're building. >> What are some of the complexity that you guys have that you're dealing with? Can you be specific and share what these guys are doing to solve that problem for you? That's, this is a big problem everyone's having, I'm seeing that all over the place. >> Absolutely, so I could start at a couple of places. So I don't know if you guys are on the three Vs, four Vs or five Vs, but we have all of those. And if you go to that five, four or five V model, there is the veracity piece, which you have to ask yourself, is it true? Is it accurate when? So change happens throughout the pipeline, change can come from web hooks, change can come from users. You have to make sure that you're managing that complexity and what we we're building, I mentioned that we are paying down a lot of tech debt, but we're also building new products. And one pretty challenging, quite challenging ETL that we're building is something going from a document store to an analytical application. So in that document store, we talked about flexible schema. Basically, you don't really know exactly what you're going to get day to day, and you need to be able to manage that change through the whole process in a way that the ultimate business users find value. So, that's one of the key applications that we're using right now. And that's one that the team at Ascend and my team are working hand in hand going through a lot of those challenges. And it's, I also watch the slack just as Sean does, and it's a very active discussion board. So it is essentially like they're just partnering together. It's fabulous, but yeah-- >> And you're seeing kind of a value on this too, I mean, in terms of output what's the business results? >> Yes, absolutely. So essentially this is all, so yes, the fifth V value. So, getting to that value is essentially, there were a few pieces of the, to the value. So there's some data products that we're building within that product and their data science, data analytics based products that essentially do things with the data that help the user. There's also the question of exactly the usage and those kinds of metrics that people in ops want to understand as well as our growth team. So we have internal and external stakeholders for that. >> Jason, this is a great use case, a great customer, Sean, you guys are automating. For the folks watching, who were seeing their peer living the dream here and the data journey, as we say, things are happening. What's the message to customers that you guys want to send because you guys are really cutting your teeth into a whole another level of data engineering, data platform. That's really about the systems view and about cloud. What's the pitch, Sean? What should people know about the company? >> Absolutely, yeah, well, so one, I'd say even before the pitch, I would encourage people to not accept the status quo. And in particular, in data engineering today, the status quo is an incredibly high degree of pain and discomfort. And I think the important part of why Ascend exists and why we're so helpful for our customers, there is a much more automated future of how we build data products, how we optimize those and how we can get a larger cohort of builders into the data ecosystem. And that helps us get out of the muck as we talked about before and put really advanced technology to work for more people inside of our companies to build these data products, leveraging the latest and greatest technologies to drive increased business value faster. >> Jason, what's your assessment of these guys, as people are watching might say, hey, you know what, I'm going to contact them, I need this. How would you talk about Ascend into your peers? >> Absolutely, so I think just thinking about the whole process has been a great partnership. We started with a POC, I think Ascend likes to start with three use cases, I think we came out with four and we went through the ones that we really cared about and really wanted to bring value to the company with. So we have roadmaps for some, as we're paying down technical debt and transitioning, others we can go directly to. And I think that thinking about just like you're saying, John, that systems view of everything you're building, where that makes sense, you can actually take a lot of that complexity and encapsulate it in a way that you can essentially manage it all in that platform. So the Ascend platform has the composibility piece that we touched on. It also, not only can you compose it, but you can drill into it. And my team is super talented and is going to drill into it. So basically loves to open up each of those data flows each of the components therein and has the control there with the combination of Spark Sequel, PI Spark SQL Scala and so on. And I think that the variety of connections is also quite helpful. So thinking about the dry principle from a systems perspective is extremely useful because it's dry, you often get that in a code review, right? I think you can be a little bit more dry here. >> Yeah. >> But you can really do that in the way that you're composing your systems as well. >> That's a great, great point. One quick thing for the folks that they're watching that are trying to figure this out, and a lot of architecture is going on. A lot of people are looking at different solutions. What things have you learned that you could give them a tip like to avoid like maybe some scar tissue or tips of the trade, where you can say, hey, this way, be careful, what's some of the learnings? Could you give a few pointers to folks out there, if they're kicking tires on the direction, what's the wrong direction? What's the right direction look like? >> Absolutely, I think that, I think it through, and I don't know how much time we have that, that feels like a few days conversation as far as ways to go wrong. But absolutely, I think that thinking through exactly where want to be is the key. Otherwise it's kind of like when you're writing a ticket on Jarrah, if you don't have clear success criteria, if you don't know where you going to go, then you'll end up somewhere building something and it might work. But if you think through your exact destination that you want to be at, that will drive a lot of the decisions as you think backwards to where you started. And also I think that, so Sean also mentioned challenging the status quo. I think that you really have to be ready to challenge the status quo at every step of that journey. So if you start with some particular service that you had and its legacy, if it's not essentially performing what you need, then it's okay to just take a step back and say, well, maybe that's not the one. So I think that thinking through the system, just like you were saying, John, and also I think that having a visual representation of where you want to go is critical. So hopefully that encapsulates a lot of it, but yes, the destination is key. >> Yeah, and having an engineering platform that also unifies the multiple components and it's agile. >> That's right. >> It gets you out of the muck and on the last day and differentiate heavy lifting is a cloud plan. >> Absolutely. >> Sean, wrap it up for us here. What's the bumper sticker for your vision, share your founding principles of the company. >> Absolutely, for us, we started the company as a former in recovery and CTO. The last company I founded, we had nearly 60 people on our data team alone and had invested tremendous amounts of effort over the course of eight years. And one of the things that I've learned is that over time innovation comes just as much from deciding what you're no longer going to do as what you're going to do. And focusing heavily around, how do you get out of that muck? How do you continue to climb up that technology stack? Is incredibly important. And so really we are excited to be a part of it and taking the industry is continuing to climb higher and higher level. We're building more and more advanced levels of automation and what we call our data awareness into the automated engine of the Ascend platform that takes us across the entire data ecosystem, connecting and automating all data movement. And so we have a very exciting vision for this fabric that's emerging over time. >> Awesome, Sean, thank you so much for that insight, Jason, thanks for coming on customer of Ascend.io. >> Thank you. >> I appreciate it, gentlemen, thank you. This is the track on automating analytic workloads. We here at the end of us showcase, startup showcase, the hottest companies here at Ascend.io, I'm John Furrier, with theCUBE, thanks for watching. (upbeat music)

Published Date : Sep 22 2021

SUMMARY :

and Jason, nice to meet you. So, and Steady as a customer. and really helping to ensure great to have you on as a person kind of intellectual property the database is you can, So all of that is built of the critical problems that the business and cultural makeup of companies. and data really is that field, that oil but what if you don't have it, right? that it's really part of the package. What do you guys see as and the idea is able to execute as far grown in the recent year And a lot of that is, I mean, that come into the data conversation. and essentially you have so and then you work it and you basically just cross your fingers, And I think this is something and how that lends to complexity that you guys have and you need to be able of exactly the usage that you guys want to send of builders into the data ecosystem. hey, you know what, I'm going and has the control there in the way that you're that you could give them a tip of where you want to go is critical. Yeah, and having an and on the last day and What's the bumper sticker for your vision, and taking the industry is continuing Awesome, Sean, thank you This is the track on

ENTITIES

Entity	Category	Confidence
Andy	PERSON	0.99+
Jason	PERSON	0.99+
Sean	PERSON	0.99+
Emily Freeman	PERSON	0.99+
Sean Knapp	PERSON	0.99+
Jason Robinson	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
John	PERSON	0.99+
Andy Jassy	PERSON	0.99+
AWS	ORGANIZATION	0.99+
John Furrier	PERSON	0.99+
15 hours	QUANTITY	0.99+
Ascend	ORGANIZATION	0.99+
last year	DATE	0.99+
96%	QUANTITY	0.99+
eight years	QUANTITY	0.99+
15 hour	QUANTITY	0.99+
iOS Store	TITLE	0.99+
18 hours	QUANTITY	0.99+
Google Play Store	TITLE	0.99+
Ascend.io	ORGANIZATION	0.99+
Steady	ORGANIZATION	0.99+
yesterday	DATE	0.99+
six months	QUANTITY	0.99+
five	QUANTITY	0.99+
third	QUANTITY	0.99+
Spark Sequel	TITLE	0.99+
two	DATE	0.98+
Today	DATE	0.98+
a year	QUANTITY	0.98+
two years	QUANTITY	0.98+
two years ago	DATE	0.98+
today	DATE	0.98+
four	QUANTITY	0.98+
Jarrah	PERSON	0.98+
each	QUANTITY	0.97+
theCUBE	ORGANIZATION	0.97+
three years ago	DATE	0.97+
one	QUANTITY	0.97+
3 million plus	QUANTITY	0.97+
4%	QUANTITY	0.97+
one thing	QUANTITY	0.96+
one team	QUANTITY	0.95+
three use cases	QUANTITY	0.94+
one person	QUANTITY	0.93+
nearly 60 people	QUANTITY	0.93+
one persona	QUANTITY	0.91+

Dinesh Nirmal, IBM | CUBEConversation

(upbeat music) >> Hi everyone. We have a special program today. We are joined by Dinesh Nirmal, who is VP of Development and Analytics, for Analytics at the IBM company and Dinesh has an extremely broad perspective on what's going on in this part of the industry, and IBM has a very broad portfolio. So, between the two of us, I think we can cover a lot of ground today. So, Dinesh, welcome. >> Oh thank you George. Great to be here. >> So just to frame the discussion, I wanted to hit on sort of four key highlights. One is balancing the compatibility across cloud on-prem, and edge versus leveraging specialized services that might be on any one of those platforms. And then harmonizing and simplifying both the management and the development of services across these platforms. You have that trade-off between: do I do everything compatibly; or can I take advantage of platforms, specific stuff? And then, we've heard a huge amount of noise on Machine Learning. And everyone says they're democratizing it. We want to hear your perspective on how you think that's most effectively done. And then, if we have time, the how to manage Machine Learning feedback, data feedback loops to improve the models. So, having started with that. >> So you talked about the private cloud and the public cloud and then, how do you manage the data and the models, or the other analytical assets across the hybrid nature of today. So, if you look at our enterprises, it's a hybrid format that most customers adopt. I mean you have some data in the public side; but you have your mission critical data, that's very core to your transactions, exist in the private cloud. Now, how do you make sure that the data that you've pushed on the cloud, that you can go use to build models? And then you can take that model deployed on-prem or on public cloud. >> Is that the emerging sort of mainstream design pattern, where mission critical systems are less likely to move, for latency, for the fact that they're fused to their own hardware, but you take the data, and the researching for the models happens up in the cloud, and then that gets pushed down close to where the transaction decisions are. >> Right, so there's also the economics of data that comes into play, so if you are doing a, you know, large scale neural net, where you have GPUs, and you want to do deep learning, obviously, you know, it might make more sense for you to push it into the cloud and be able to do that or one of the other deep learning frameworks out there. But then you have your core transactional data that includes your customer data, you know, or your customer medical data, which I think some customers might be reluctant to push on a public cloud, and then, but you still want to build models and predict and all those things. So I think it's a hybrid nature, depending on the sensitivities of the data. Customers might decide to put it on cloud versus private cloud which is in their premises, right? So then how do you serve those customer needs, making sure that you can build a model on the cloud and that you can deploy that model on private cloud or vice versa. I mean, you can build that model on private cloud or only on private, and then deployed on your public cloud. Now the challenge, one last statement, is that people think, well, once I build a model, and I deploy it on public cloud, then it's easy, because it's just an API call at that time, just to call that model to execute the transactions. But that's not the case. You take support vector machine, for example, right, that still has vectors in there, that means your data is there, right, so even though you're saying you're deploying the model, you still have sensitive data there, so those are the kind of things customers need to think about before they go deploy those models. >> So I might, this is a topic for our Friday interview with a member of the Watson IT family, but it's not so black and white when you say we'll leave all your customer data with you, and we'll work on the models, because it, sort of like, teabags, you know, you can take the customer's teabag and squeeze some of the tea out, in your IBM or public cloud, and give them back the teabag, but you're getting some of the benefit of this data. >> Right, so like, it depends, depends on the algorithms you build. You could take a linear regression, and you don't have the challenges I mentioned, in support of retro machine, because none of the data is moving, it's just modeled. So it depends, I think that's where, you know, like Watson has done, will help tremendously because the data is secure in that sense. But if you're building on your own, it's a different challenge, you've got to make sure you pick the right algorithms to do that. >> Okay, so let's move on to the modern sort of what we call operational analytic pipeline, where the key steps are ingest, process, analyze, predict, serve, and you can drill down on those more. Today there's, those pipelines are pretty much built out of multi-vendor components. How do you see that evolving under pressures of, or tension between simplicity, coming from one vendor, and the pieces all designed together, and the specialization, where you want to have a, you know, unique tool in one component. >> Right, so you're exactly right. So you can take a two prong approach. One is, you can go to a cloud provider, and get each of the services, and you stitch it together. That's one approach. A challenging approach, but that has its benefits, right, I mean, you bring some core strengths from each vendor into it. The other one is the integrate approach, where you ingest the data, you shape or cleanse the data, you get it prepared for analytics, you build the model, you predict, you visualize. I mean, that all comes in one. The benefit there is you get the whole stack in one, you have one you have a whole pipeline that you can execute, you have one service provider that's giving them services, it's managed. So all those benefits come with it, and that's probably the preferred way for it integrated all together in one stack, I think that's the path most people go towards, because then you have the whole pipeline available to you, and also the services that comes with it. So any updates that comes with it, and how do you make sure, if you take the first round, one challenge you have is how do you make sure all these services are compatible with each other? How do you make sure they're compliant? So if you're an insurance company, you want it to be HIPAA compliant. Are you going to individually make sure that each of these services are HIPAA compliant? Or would you get from one integrated provider, you can make sure they are HIPAA compliant, tests are done, so all those benefits, to me, outweigh you going, putting unmanaged service all together, and then creating a data link to underlay all of it. >> Would it be fair to say, to use an analogy, where Hadoop, being sort of, originating in many different Apache products, is a quasi-multi vendor kind of pipeline, and the state of, the state of the machine learning analytic pipeline, is still kind of multi-vendor today. You see that moving toward single vendor pipeline, who do you see as the sort of last man standing? >> So, I mean, I can speak from an IBM perspective, I can say that the benefit that a company, a vendor like IBM brings forward, is like, so the different, public or private cloud or hybrid, you obviously have the choice of going to public cloud, you can get the same service on public cloud, so you get a hybrid experience, so that's one aspect of it. Then, if you get the integrated solution, all the way from ingest to visualization, you have one provider, it's tested, it's integrated, you know, it's combined, it works well together, so I would say, going forward, if you look at it purely from an enterprise perspective, I would say integrated solutions is the way to go, because that what will be the last man standing. I'll give you an example. I was with a major bank in Europe, about a month ago, and I took them through our data science experience, our machine learning project and all that, and you know, the CTO's take was that, Dinesh, I got it. Building the model itself, it only took us two days, but incorporating our model into our existing infrastructure, it has been 11 months, we haven't been able to do it. So that's the challenge our enterprises face, and they want an integrated solution to bring that model into their existing infrastructure. So that's, you know, that's my thought. >> Today though, let's talk about the IBM pipeline. Spark is core, Ingest is, off the-- >> Dinesh: Right, so you can do spark streaming, you can use Kafka, or you can use infostream which is our proprietary tool. >> Right, although, you wouldn't really use structured streaming for ingest, 'cause of the back pressure? >> Right, so they are-- >> The point that I'm trying to make is, it's still multi-vendor, and then the serving side, I don't know, where, once the analysis is done and predictions are made, some sort of sequel database has to take over, so it's, today, it's still pretty multi vendor. So how do you see any of those products broadening their footprints so that the number of pieces decreases. >> So good question, they are all going to get into end pipeline, because that's where the value is, unless you provide an integrated end to end solution for a customer, especially parts customer it's all about putting it all together, and putting these pieces together is not easy, even if you ingest the data, IOP kind of data, a lot of times, 99% of the time, data is not clean, unless you're in a competition where you get cleansed data, in real world, that never happens. So then, I would say 80% of a data scientists time is spent on cleaning the data, shaping the data, preparing the data to build that pipeline. So for most customers, it's critical that they get that end to end, well oiled, well connected solution integrated solution, than take it from each vendor, every isolated solution. To answer your question, yes, every vendor is going to move into the ingest, data cleansing phase, transformation, and the building the pipeline and then visualization, if you look at those five steps, has to be developed. >> But just building the data cleansing and transformation, having it in your, native to your own pipeline, that doesn't sound like it's going to solve the problem of messy data that needs, you know, human supervision to correct. >> I mean, so there is some level of human supervision to be sure, so I'll give you an example, right, so when data from an insurance company goes, a lot of times, the gender could be missing, how do you know if it's a male or female? Then you got to build another model to say, you know, this patient has gone for a prostate exam, you know, it's a male, gynecology is a female, so you have to do some intuitary work in there, to make sure that the data is clean, and then there's some human supervision to make sure that this is good to build models, because when you're executing that pipeline in real time, >> Yeah. >> It's all based on the past data, so you want to make sure that the data is as clean as possible to train and model, that you're going to execute on. >> So, let me ask you, turning to a slide we've got about complexity, and first, for developers, and then second, for admins, if we take the steps in the pipeline, as ingest, process, analyze, predict, serve, and sort of products or product categories as Kafka, Spark streaming and sequel, web service for predict, and MPP sequel, or no sequel for serve, even if they all came from IBM, would it be possible to unify the data model, the addressing and name space, and I'm just kicking off a few that I can think of, programming model, persistence, transaction model, workflow, testing integration, there's one thing to say it's all IBM, and then there's another thing, so that the developer working with it, sees as it as one suite. >> So it has to be validated, and that's the benefit that IBM brings already, because we obviously test each segment to make sure it works, but when you talk about complexity, building the model is one, you know, development of the model, but now the complexity also comes in the deployment of the model, now we talk about the management of the model, where, how you monitor it? When was the model deployed, was it deployed in tests, was it deployed in production, and who changed that model last, what was changed, and how is it scoring? Is it scoring high or low? You want to get notification when the model starts going low. So complexity is all the way across, all the way from getting the data, in cleaning the data, developing the model, it never ends. And the other benefit that IBM has added is the feedback loop, where when you talk about complexity, it reduces the complexity, so today, if the model scores low, you have to take it offline, retrain the model based on the new data, and then redeploy it. Usually for enterprises, there is slots where you can take it offline, put it back online, all these things, so it's a process. What we have done is created a feedback loop where we are training the model in real time, using real time data, so the model is continuously-- >> Online learning. >> Online learning. >> And challenger, champion, or AB testing to see which one is more robust. >> Right, so you can do that, I mean, you could have multiple models where you can do AB testing, but in this case, you can condition, train the model to say, okay, this model scores the best. And then, another benefit is that, if you look at the whole machine learning process, there's the data, there's development, there's deployment. On development side, more and more it's getting commoditized, meaning picking the right algorithm, there's a lot of tools, including IBM, where he can say, question what's the right one to use for this, so that piece is getting a little, less complex, I don't want to say easier, but less complex, but the data cleansing and the deployment, these are two enterprises, when you have thousands of models how do you make sure that you deploy the right model. >> So you might say that the pipeline for managing the model is sort of separate from the original data pipeline, maybe it includes the same technology, or as much of the same technology, but once your pipeline, your data pipeline is in production, the model pipeline has to keep cycling through. >> Exactly, so the data pipeline could be changing, so if you take a lone example, right, a lot of the data that goes in the model pipeline, is static, I mean, my age, it's not going to change every day, I mean, it is, but you know, the age that goes into my salary, my race, my gender, those are static data that you can take from data and put it in there, but then there's also real time data that's coming, my loan amount, my credit score, all those things, so how do you bring that data pipeline between real time and static data, into the model pipeline, so the model can predict accurately and based on the score dipping, you should be able to re-try the model using real time data. >> I want to take, Dinesh, to the issue of a multi-vendor stack again, and the administrative challenges, so here, we look at a slide that shows me just rattling off some of the admin challenges, governance, performance modeling, scheduling orchestration, availability, recovering authentication, authorization, resource isolation, elasticity, testing integration, so that's the Y-axis, and then for every different product in the pipeline, as the access, say Kafka, Spark, structured streaming MPP, sequel, no sequel, so you got a mess. >> Right. >> Most open source companies are trying to make life easier for companies by managing their software as a service for the customer, and that's typically how they monetize. But tell us what you see the problem is, or will be with that approach. >> So, great question. Let me take a very simple example. Probably most of our audience know about GDPR, which is the European law to write to forget. So if you're an enterprise, and say, George, I want my data deleted, you have to delete all of my data within a period of time. Now, that's where one of the aspects you talked about with governance comes in. How do you make sure you have governance across not just data but your individual assets? So if you're using a multi-vendor solution, in all of that, that state governance, how do I make sure that data get deleted by all these services that's all tied together. >> Let me maybe make an analogy. On CSI, when they pick up something at the crime scene, they got to make sure that it's bagged, and the chain of custody doesn't lose its integrity all the way back to the evidence room. I assume you're talking about something like that. >> Yeah, something similar. Where the data, as it moves between private cloud, public cloud, analytical assets, is using that data, all those things need to work seamlessly for you to execute that particular transaction to delete data from everywhere. >> So that's, it's not just administrative costs, but regulations that are pushing towards more homogenous platforms. >> Right, right, and even if you take some of the other things on the stack, monitoring, logging, metering, provides some of those capabilities, but you have to make sure when you put all these services together, how are they going to integrate all together? You have one monitoring stack, so if you're pulling you know, your IOT kind of data into a data center, or your whole stack evaluation, how do you make sure you're getting the right monitoring data across the board? Those are the kind of challenges that you will have. >> It's funny you mention that, because we were talking to an old Lotus colleague of mine, who was CTO of Microsoft's IT organization, and we were talking about how the cloud vendors can put machine learning application, machine learning management application across their properties, or their services, but he said one of the first problems he'll encounter is the telemetry, like it's really easy on hardware, CPUs, utilization, memory utilization, a noise enabler for iO, but as you get higher up in the application services, it becomes much more difficult to harmonize, so that a program can figure out what's going wrong. >> Right, and I mean, like anomaly detection, right? >> Yes. >> I mean, how do you make sure you're seeing patterns where you can predict something before it happens, right? >> Is that on the road map for...? >> Yeah, so we're already working with some big customers to say, if you have a data center, how do you look at outage to predict what can go wrong in the future, root cause analysis, I mean, that is a huge problem solved. So let's say customer hit a problem, you took an outage, what caused it? Because today, you have specialists who will come and try to figure out what the problem is, but can we use machine learning or deep learning to figure out, is it a fix that was missing, or an application got changed that caused a CPU spike, that caused the outage? So that whole cost analysis is the one that's the hardest to solve, because you are talking about people's decades worth of knowledge, now you are influencing a machine to do that prediction. >> And from my understanding, root cause analysis is most effective when you have a rich model of how your, in this case, data structure and apps are working, and there might be many little models, but they're held together by some sort of knowledge graph that says here is where all the pieces fit, these are the pieces below these, sort of as peers to these other things, how does that knowledge graph get built in, and is this the next generation of a configuration management database. >> Right, so I call it the self-healing, self-managing, self-fixing data center. It's easy for you to turn up the heat or A/C, the temperature goes down, I mean, those are good, but the real value for a customer is exactly what you mentioned, building up that knowledge graft from different models that all comes together, but the hardest part, is, how do you, predicting an anomaly is one thing, but getting to the root cause is a different thing, because at that point, now you're saying, I know exactly what's caused this problem, and I can prevent it from happening again. That's not easy. We are working with our customers to figure out how do we get to the root cause analysis, but it's all about building the knowledge graph with multiple models coming from different systems, today, I mean enterprises have different systems from multi-vendors. We have to bring all that monitoring data into one source, and that's where that knowledge comes in, and then different models will feed that data, and then you need to prime that data, using deep learning algorithms to say, what caused this? >> Okay, so this actually sounds extremely relevant, although we're probably, in the interest of time, going to have to dig down on that one another time, but, just at a high level, it sounds like the knowledge graph is sort of your web or directory, into how local components or local models work, and then, knowing that, if it sees problems coming up here, it can understand how it affects something else tangentially. >> So think of knowledge graph as a neural net, because it's just building new neural net based on the past data, and it has that built-in knowledge where it says, okay, these symptoms seem to be a problem that I have encountered in the past. Now I can predict the root cause because I know this happened in the past. So it's kind of like you putting that net to build new problem determinations as it goes along. So it's a complex task. It's not easy to get to root cause analysis. But that's something we are aggressively working on developing. >> Okay, so let me ask, let's talk about sort of democratizing machine learning and the different ways of doing that. You've actually talked about the big pain points, maybe not so sexy, but that are critical, which is operationalizing the models, and preparing the data. Let me bounce off you some of the other approaches. One that we have heard from Amazon is that they're saying, well, data expunging might be an issue, and operationalizing the models might be an issue, but the biggest issue in terms of making this developer ready, is we're going to take the machine learning we use to run our business, whether it's merchandising fashion, running recommendation engines, managing fulfillment or logistics, and just like I did with AWS, they're dog-fooding it internally, and then they're going to put it out on AWS as a new layer of a platform. Where do you see that being effective, and where less effective? >> Right, so let me answer the first part of your question, the democratization of learning. So that happens when for example, a real estate agent who has no idea about machine learning, be able to come and predict the house prices in this area. That's to me, is democratizing. Because at that time, you have made it available to everyone, everyone can use it. But that comes back to our first point, which is having that clean set of data. You can build all the pre-canned pipelines out there, but if you're not feeding the set of data into, none of this, you know. Garbage in, garbage out, that's what you're going to get. So when we talk about democratization, it's not that easy and simple because you can build all this pre-canned pipelines that you have used in-house for your own purposes, but every customer has many unique cases. So if I take you as a bank, your fraud detection methods is completely different than me as a bank, my limit for fraud detection could be completely different. So there is always customization that's involved, the data that's coming in is different, so while it's a buzzword, I think there's knowledge that people need to feed it, there's models that needs to be tuned and trained, and there's deployment that is completely different, so you know, there is work that has to be done. >> So then what I'm taking away from what you're saying is, you don't have to start from ground zero with your data, but you might want to add some of your data, which is specialized, or slightly different from what the pre-trained model is, you still have to worry about operationalizing it, so it's not a pure developer ready API, but it uplevels the skills requirement so that it's not quite as demanding as working with TensorFlow or something like that. >> Right, I mean, so you can always build pre-canned pipelines and make it available, so we have already done that. For example, fraud detection, we have pre-canned pipelines for IT analytics, we have pre-canned pipelines. So it's nothing new, you can always do what you have done in house, and make it available to the public or the customers, but then they have to take it and have to do customization to meet their demands, bring their data to re-train the model, all those things has to be done, it's not just about providing the model, but every customer use case is completely different, whether you are looking at fraud detection from that one bank's perspective, not all banks are going to do the same thing. Same thing for predicting, for example, the loan, I mean, your loan approval process is going to be completely different than me as a bank loan approval process. >> So let me ask you then, and we're getting low on time here, but what would you, if you had to characterize Microsoft, Azure, Google, Amazon, as each bringing to bear certain advantages and disadvantages, and you're now the ambassador, so you're not a representative of IBM, help us understand the sweet spot for each of those. Like you're trying to fix the two sides of the pipeline, I guess, thinking of it like a barbell, you know, where are the others based on their data assets and their tools, where do they need to work. >> So, there's two aspects of it, there's enterprise aspect, so as an enterprise, I would like to say, it's not just about the technology, but there's also the services aspect. If my model goes down in the middle of the night, and my banking app is down, who do I call? If I'm using a service that is available on the cloud provider which is open source, do I have the right amount of coverage to call somebody and fix it. So there's the enterprise capabilities, availabilities, reliability, that is different, than a developer comes in, has a CSV file that he or she wants to build a model to predict something, that's different, this is different, two different aspects. So if you talk about, you know, all these vendors, if I'm bearing an enterprise card, some of the things I would look is, can I get an integrated solution, end to end on the machine learning platform. >> And that means end to end in one location, >> Right. >> So you don't have network issues or latency and stuff like that. >> Right, it's an integrated solution, where I can bring in the data, there's no challenges to latency, those kinds of things, and then can I get the enterprise level service, SLA all those things, right? So, in there, the named vendors obviously have an upper hand, because they are preferred to enterprises than a brand new open source that will come along, but then there is, within enterprises, there are a line of businesses building models, using some of the open source vendors, which is okay, but eventually they'd have to get deployed and then how do you make sure you have that enterprise capabilities up there. So if you ask me, I think each vendor brings some capabilities. I think the benefit IBM brings in is, one, you have the choice or the freedom to bring in cloud or on-prem or hybrid, you have all the choices of languages, like we support R, Python Spar, Spark, I mean, SPS, so I mean, the choice, the freedom, the reliability, the availability, the enterprise nature, that's where IBM comes in and differentiates, and that's for our customers, a huge plus. >> One last question, and we're really out of time, in terms of thinking about a unified pipeline, when we were at Spark Summit, sitting down with Matei Zaharia and Reynold Shin, the question came up, the data breaks has an incomplete pipeline, no persistence, no ingest, not really much in the way of serving, but boy are they good at, you know, data transmigration, and munging and machine learning, but they said they consider it part of their ultimate responsibility to take control. And on the ingest side it's Kafka, the serving side, might be Redis or something else, or the Spark databases like Snappy Data and Splice Machine. Spark is so central to IBM's efforts. What might a unified Spark pipeline look like? Have you guys thought about that? >> It's not there, obviously they probably could be working on it, but for our purpose, Spark is critical for us, and the reason we invested in Spark so much is because of the executions, where you can take a tremendous amount of data, and, you know, crunch through it in a very short amount of time, that's the reason, we also invented Spark Sequel, because we have a good chunk of customers still use Sequel heavily, We put a lot of work into the Spark ML, so we are continuing to invest, and probably they will get to and integrated into a solution, but it's not there yet, but as it comes along, we'll adapt. If it meets our needs and demands, and enterprise can do it, then definitely, I mean, you know, we saw that Spark's core engine has the ability to crunch a tremendous amount of data, so we are using it, I mean, 45 of our internal products use Spark as our core engine. Our DSX, Data Science Experience, has Spark as our core engine. So, yeah, I mean, today it's not there, but I know they're probably working on it, and if there are elements of this whole pipeline that comes together, that is convenient for us to use, and at enterprise level, we will definitely consider using it. >> Okay, on that note, Dinesh, thanks for joining us, and taking time out of your busy schedule. My name is George Gilbert, I'm with Dinesh Nirmal from IBM, VP of Analytics Development, and we are at the Cube studio in Palo Alto, and we will be back in the not too distant future, with more interesting interviews with some of the gurus at IBM. (peppy music)

Published Date : Aug 22 2017

SUMMARY :

So, between the two of us, I think Oh thank you George. the how to manage Machine Learning feedback, that you can go use to build models? but you take the data, and the researching for and that you can deploy that model on private cloud but it's not so black and white when you say and you don't have the challenges I mentioned, and the specialization, where you want to have and get each of the services, and you stitch it together. who do you see as the sort of last man standing? So that's, you know, that's my thought. Spark is core, Ingest is, off the-- Dinesh: Right, so you can do spark streaming, so that the number of pieces decreases. and then visualization, if you look at those five steps, of messy data that needs, you know, human supervision so you want to make sure that the data is as clean as in the pipeline, as ingest, process, analyze, if the model scores low, you have to take it offline, to see which one is more robust. Right, so you can do that, I mean, you could have So you might say that the pipeline for managing I mean, it is, but you know, the age that goes MPP, sequel, no sequel, so you got a mess. But tell us what you see the problem is, Now, that's where one of the aspects you talked about and the chain of custody doesn't lose its integrity for you to execute that particular transaction to delete but regulations that are pushing towards more Those are the kind of challenges that you will have. It's funny you mention that, because we were to say, if you have a data center, how do you look at most effective when you have a rich model and then you need to prime that data, using deep learning but, just at a high level, it sounds like the knowledge So it's kind of like you putting that net Let me bounce off you some of the other approaches. pipelines that you have used in-house for your own purposes, the pre-trained model is, you still have to worry So it's nothing new, you can always do what you have So let me ask you then, and we're getting low on time So if you talk about, you know, all these vendors, So you don't have network issues or latency and then how do you make sure you have that but boy are they good at, you know, where you can take a tremendous amount of data, of the gurus at IBM.

ENTITIES

Entity	Category	Confidence
Microsoft	ORGANIZATION	0.99+
George Gilbert	PERSON	0.99+
IBM	ORGANIZATION	0.99+
George	PERSON	0.99+
Europe	LOCATION	0.99+
Amazon	ORGANIZATION	0.99+
Dinesh Nirmal	PERSON	0.99+
99%	QUANTITY	0.99+
Google	ORGANIZATION	0.99+
80%	QUANTITY	0.99+
Palo Alto	LOCATION	0.99+
two	QUANTITY	0.99+
HIPAA	TITLE	0.99+
Dinesh	PERSON	0.99+
Reynold Shin	PERSON	0.99+
Friday	DATE	0.99+
AWS	ORGANIZATION	0.99+
Today	DATE	0.99+
today	DATE	0.99+
five steps	QUANTITY	0.99+
45	QUANTITY	0.99+
two days	QUANTITY	0.99+
11 months	QUANTITY	0.99+
each segment	QUANTITY	0.99+
first part	QUANTITY	0.99+
two enterprises	QUANTITY	0.99+
One	QUANTITY	0.99+
first point	QUANTITY	0.99+
first round	QUANTITY	0.99+
each vendor	QUANTITY	0.99+
Lotus	TITLE	0.99+
each	QUANTITY	0.99+
Azure	ORGANIZATION	0.99+
two aspects	QUANTITY	0.99+
one challenge	QUANTITY	0.99+
one approach	QUANTITY	0.99+
Spark	TITLE	0.99+
two sides	QUANTITY	0.99+
Cube	ORGANIZATION	0.99+
one stack	QUANTITY	0.98+
one component	QUANTITY	0.98+
one source	QUANTITY	0.98+
GDPR	TITLE	0.98+
One last question	QUANTITY	0.98+
one	QUANTITY	0.98+
thousands of models	QUANTITY	0.98+
one vendor	QUANTITY	0.98+
both	QUANTITY	0.98+
Kafka	TITLE	0.97+
one thing	QUANTITY	0.97+
Sequel	TITLE	0.97+
one location	QUANTITY	0.97+
second	QUANTITY	0.96+

Bruno Aziza & Josh Klahr, AtScale - Big Data SV 17 - #BigDataSV - #theCUBE1

>> Announcer: Live from San Jose, California, it's The Cube. Covering Big Data, Silicon Valley, 2017. (electronic music) >> Okay, welcome back everyone, live at Silicon Valley for the big The Cube coverage, I'm John Furrier, with me Wikibon analyst George Gilbert, Bruno Aziza, who's on the CMO of AtScale, Cube alumni, and Josh Klahr VP at AtScale, welcome to the Cube. >> Welcome back. >> Thank you. >> Thanks, Brian. >> Bruno, great to see you. You look great, you're smiling as always. Business is good? >> Business is great. >> Give us the update on AtScale, what's up since we last saw you in New York? >> Well, thanks for having us, first of all. And, yeah, business is great, we- I think Last time I was here on The Cube we talked about the Hadoop Maturity Survey and at the time we'd just launched the company. And, so now you look about a year out and we've grown about 10x. We have large enterprises across just about any vertical you can think of. You know, financial services, your American Express, healthcare, think about ETNA, SIGNA, GSK, retail, Home Depot, Macy's and so forth. And, we've also done a lot of work with our partner Ecosystem, so Mork's- OEM's AtScale technology which is a great way for us to get you AtScale across the US, but also internationally. And then our customers are getting recognized for the work that they are doing with AtScale. So, last year, for instance, Yellowpages got recognized by Cloudera, on their leadership award. And Macy's got a leadership award as well. So, things are going the right trajectory, and I think we're also benefitting from the fact that the industry is changing, it's maturing on the the big data side, but also there's a right definition of what business intelligence means. This idea that you can have analytics on large-scale data without having to change your visualization tools and make that work with existing stock you have in place. And, I think that's been helping us in growing- >> How did you guys do it? I mean, you know, we've talked many times in there's some secret sauce there, but, at the time when you guys were first starting it was kind of crowded field, right? >> Bruno: Yeah. >> And all these BI tools were out there, you had front end BI tools- >> Bruno: Yep. But everyone was still separate from the whole batch back end. So, what did you guys do to break out? >> So, there's two key differentiators with AtScale. The first one is we are the only platform that does not have a visualization tool. And, so people think about this as, that's a bug, that's actually a feature. Because, most enterprises have already that stuff made with traditional BI tools. And so our ability to talk to MDX and SQL types of BI tools, without any changes is a big differentiator. And then the other piece of our technology, this idea that you can get the speed, the scale and security on large data sets without having to move the data. It's a big differentiation for our enterprise to get value out of the data. They already have in Hadoop as well as non-Hadoop systems, which we cover. >> Josh, you're the VP of products, you have the roadmaps, give us a peek into what's happening with the current product. And, where's the work areas? Where are you guys going? What's the to-do list, what's the check box, and what's the innovation coming around the corner? >> Yeah, I think, to follow up on what Bruno said about how we hit the sweet spot. I think- we made a strategic choice, which is we don't want to be in the business of trying to be Tableu or Excel or be a better front end. And there's so much diversity on the back end if you look at the ecosystem right now, whether it's Spark Sequel, or Hive, or Presto, or even new cloud based systems, the sweet spot is really how do you fit into those ecosystems and support the right level of BI on top of those applications. So, what we're looking at, from a road map perspective is how do we expand and support the back end data platforms that customers are asking about? I think we saw a big white space in BI on Hadoop in particular. And that's- I'd say, we've nailed it over the past year and a half. But, we see customers now that are asking us about Google Big Query. They're asking us about Athena. I think these server-less data platforms are really, really compelling. They're going to take a while to get adoption. So, that's a big investment area for us. And then, in terms of supporting BI front ends, we're kind of doubling down on making sure our Tableau integration is great, Power BI is I think getting really big traction. >> Well, two great products, you've got Microsoft and Tableau, leaders in that area. >> The self-service BI revolution has, I would say, has won. And the business user wants their tool of choice. Where we come in is the folks responsible for data platforms on the back end, they want some level of control and consistency and so they're trying to figure out, where do you draw the line? Where do you provide standards? Where do you provide governance, and where do you let the business lose? >> All right, so, Bruno and Josh, I want you to answer the questions, be a good quiz. So, define next generation BI platforms from a functional standpoint and then under the hood. >> Yeah, there's a few things you can look at. I think if you were at the Gartner BI conference last week you saw that there was 24 vendors in the magic quadrant and I think in general people are now realizing that this is a space that is extremely crowded and it's also sitting on technology that was built 20 years ago. Now, when you talk to enterprises like the ones we work with, like, as I named earlier, you realize that they all have multiple BI tools. So, the visualization war, if you will, kind of has been set up and almost won by Microsoft and Tableau at this point. And, the average enterprise is 15 different BI tools. So, clearly, if you're trying to innovate on the visualization side, I would say you're going to have a very hard time. So, you're dealing with that level of complexity. And then, at the back end standpoint, you're now having to deal with database from the past - that's the Teradata of this world - data sources from today - Hadoop - and data sources from the future, like Google Big Query. And, so, I think the CIO answer of what is the next gen BI platform I want is something that is enabling me to simplify this very complex world. I have lots of BI tools, lots of data, how can I standardize in the middle in order to provide security, provide scale, provide speed to my business users and, you know, that's really radically going to change the space, I think. If you're trying to sell a full stack that's integrated from the bottom all the way to visualization, I don't think that's what enterprises want anymore >> Josh, under the hood, what's the next generation- you know, key leverage for the tech, and, just the enabler. >> Yeah, so, for me the end state for the next generation GI platform is a user can log in, they can point to their data, wherever that data is, it's on Prime, it's in the cloud, it's in a relational database, it's a flat file, they can design their business model. We spend a lot of time making sure we can support the creation of business models, what are the key metrics, what are the hierarchies, what are the measures, it may sound like I'm talking about OLAP. You know, that's what our history is steeped in. >> Well, faster data is coming, that's- streaming and data is coming together. >> So, I should be able to just point at those data sets and turn around and be able to analyze it immediately. On the back end that means we need to have pretty robust modeling capabilities. So that you can define those complex metrics, so you can functionally do what are traditional business analytics, period over period comparisons, rolling averages, navigate up and down business hierarchies. The optimizations should be built in. It shouldn't be the responsibility of the designer to figure out, do I need to create indeces, do I need to create aggregates, do I need to create summarization? That should all be handled for you automatically. Shouldn't think about data movement. And so that's really what we've built in from an AtScale perspective on the back end. Point to data, we're smart about creating optimal data structure so you get fast performance. And then, you should be able to connect whatever BI tool you want. You should be able to connect Excel, we can talk the MDX Query language. We can talk Sequel, we can talk Dax, whatever language you want to talk. >> So, take the syntax out of the hands of the user. >> Yeah. >> Yeah. >> And getting in the weeds on that stuff. Make it easier for them- >> Exactly. >> And the key word I think, for the future of BI is open, right? We've been buying tools over the last- >> What do you mean by that, explain. >> Open means that you can choose whatever BI tool you want, and you can choose whatever data you want. And, as a business user there's no real compromise. But, because you're getting an open platform it doesn't mean that you have to trade off complexity. I think some of the stuff that Josh was talking about, period analysis, the type of multidimensional analysis that you need, calendar analysis, historical data, that's still going to be needed, but you're going to need to provide this in a world where the business, user, and IT organization expects that the tools they buy are going to be open to the rest of the ecosystem, and that's new, I think. >> George, you want to get a question in, edgewise? Come on. (group laughs) >> You know, I've been sort of a single-issue candidate, I guess, this week on machine learning and how it's sort of touching all the different sectors. And, I'm wondering, are you- how do you see yourselves as part of a broader pipeline of different users adding different types of value to data? >> I think maybe on the machine learning topic there is a few different ways to look at it. The first is we do use machine learning in our own product. I talked about this concept of auto-optimization. One of the things that AtScale does is it looks at end-user query patterns. And we look at those query patterns and try to figure out how can we be smart about anticipating the next thing they're going to ask so we can pre-index, or pre-materialize that data? So, there's machine learning in the context of making AtScale a better product. >> Reusing things that are already done, that's been the whole machine-learning- >> Yes. >> Demos, we saw Google Next with the video editing and the video recognition stuff, that's been- >> Exactly. >> Huge part of it. >> You've got users giving you signals, take that information and be smart with it. I think, in terms of the customer work flow - Comcast, for example, a customer of ours - we are in a data discovery phase, there's a data science group that looks at all of their set top box data, and they're trying to discover programming patterns. Who uses the Yankees' network for example? And where they use AtScale is what I would call a descriptive element, where they're trying to figure out what are the key measures and trends, and what are the attributes that contribute to that. And then they'll go in and they'll use machine learning tools on top of that same data set to come up with predictive algorithms. >> So, just to be clear there, they're hypotehsizing about, like, say, either the pattern of users that might be- have an affinity for a certain channel or channels, or they're looking for pathways. >> Yes. And I'd say our role in that right now is a descriptive role. We're supporting the descriptive element of that analytics life cycle. I think over time our customers are going to push us to build in more of our own capabilities, when it comes to, okay, I discovered something descriptive, can you come up with a model that helps me predict it the next time around? Honestly, right now people want BI. People want very traditional BI on the next generation data platform. >> Just, continuing on that theme, leaving machine learning aside, I guess, as I understand it, when we talked about the old school vendors, Care Data, when they wanted to support data scientists they grafted on some machine learning, like a parallel version of our- in the core Teradata engine. They also bought Astro Data, which was, you know, for a different audience. So, I guess, my question is, will we see from you, ultimately, a separate product line to support a new class of users? Or, are you thinking about new functionality that gets integrated into the core product. I think it's more of the latter. So, the way that we view it- and this is really looking at, like I said, what people are asking for today is, kind of, the basic, traditional BI. What we're building is essentially a business model. So, when someone uses AtScale, they're designing and they're telling us, they're asserting, these are the things I'm interested in measuring, and these are the attributes that I think might contribute to it. And, so that puts us in a pretty good position to start using, whether it's Spark on the back end, or built in machine learning algorithms on the Hadoop cluster, let's start using our knowledge of that business model to help make predictions on behalf of the customer. So, just a follow-up, and this really leaves out the machine learning part, which is, it sounds like, we went- in terms of big data we we first to archive it- supported more data retension than could do affordably with the data warehouse. Then we did the ETL offload, now we're doing more and more of the visualization, the ad-hoc stuff. >> That's exactly right. So, what- in a couple years time, what remains in the classic data warehouse, and what's in the Hadoop category? >> Well, so there is, I think what you're describing is the pure evolution, of, you know, any technology where you start with the infrastructure, you know, we've been in this for over ten years, now, you've got cloud. They are going APO and then going into the data science workbench. >> That's not official yet. >> I think we read about this, or at least they filed. But I think the direction is showing- now people are relying on the platform, the Hadoop platform, in order to build applications on top of it. And, so, I think, just like Josh is saying, the mainstream application on top of the database - and I think this is true for non-Hadoop systems as well - is always going to be analytics. Of course, data science is something that provides a lot of value, but it typically provides a lot of value to a few set of people that will then scale it out to the rest of their organization. I think if you now project out to what does this mean for the CIO and their environment, I don't think any of these platforms, Teradata or Hadoop, or Google, or Amazon or any of those, I don't think do 100% replace. And, I think that's where it becomes interesting, because you're now having to deal with a hetergeneous environment, where the business user is up, they're using Excel, they're using they're standard net application, they might be using the result of machine learning models, but they're also having to deal with the heterogeneous environment at the data level. Hadoop on Prime, Hadoop in the cloud, non-Hadoop in the cloud and non-Hadoop on Prime. And, of course that's a market that I think is very interesting for us as a simplification platform for that world. >> I think you guys are really thinking about it in a new way, and I think that's kind of a great, modern approach, let the freedom- and by the way, quick question on the Microsoft tool and Tableau, what percentage share do you think they are of the market? 50? Because you mentioned those are the two top ones. >> Are they? >> Yeah, I mentioned them, because if you look at the magic quadrant, clearly Microsoft, Power BI and Tableau have really shot up all the way to the right. >> Because it's easy to use, and it's easy to work with data. >> I think so, I think- look, from a functionality standpoint, you see Tableau's done a very good job on the visualization side. I think, from a business standpoint, and a business model execution, and I can talk from my days at Microsoft, it's a very great distribution model to get thousands and thousands of users to use power BI. Now, the guys that we didn't talk about on the last magic quadrant. People who are like Google Data Studio, or Amazon Quicksite, and I think that will change the ecosystem as well. Which, again, is great news for AtScale. >> More muscle coming in. >> That's right. >> For you guys, just more rising tide floats all boats. >> That's right. >> So, you guys are powering it. >> That's right. >> Modern BI would be safe to say? >> That's the idea. The idea is that the visualization is basically commoditized at this point. And what business users want and what enterprise leaders want is the ability to provide freedom and openness to their business users and never have to compromise security, speed and also the complexity of those models, which is what we- we're in the business of. >> Get people working, get people productive faster. >> In whatever tool they want. >> All right, Bruno. Thanks so much. Thanks for coming on. AtScale. Modern BI here in The Cube. Breaking it down. This is The Cube covering bid data SV strata Hadoop. Back with more coverage after this short break. (electronic music)

Published Date : Mar 15 2017

SUMMARY :

it's The Cube. live at Silicon Valley for the big The Cube coverage, Bruno, great to see you. Hadoop Maturity Survey and at the time So, what did you guys do to break out? this idea that you can get the speed, What's the to-do list, what's the check box, the sweet spot is really how do you Microsoft and Tableau, leaders in that area. and where do you let the business lose? I want you to answer the questions, So, the visualization war, if you will, and, just the enabler. for the next generation GI platform is and data is coming together. of the designer to figure out, So, take the syntax out of the hands And getting in the weeds on that stuff. the type of multidimensional analysis that you need, George, you want to get a question in, edgewise? all the different sectors. the next thing they're going to ask You've got users giving you signals, either the pattern of users that might be- on the next generation data platform. So, the way that we view it- and what's in the Hadoop category? is the pure evolution, of, you know, the Hadoop platform, in order to build applications I think you guys are really thinking about it because if you look at the magic quadrant, and it's easy to work with data. Now, the guys that we didn't talk about For you guys, just more The idea is that the visualization This is The Cube covering bid data

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Bruno	PERSON	0.99+
Bruno Aziza	PERSON	0.99+
George	PERSON	0.99+
Comcast	ORGANIZATION	0.99+
ETNA	ORGANIZATION	0.99+
Brian	PERSON	0.99+
John Furrier	PERSON	0.99+
New York	LOCATION	0.99+
Josh Klahr	PERSON	0.99+
SIGNA	ORGANIZATION	0.99+
GSK	ORGANIZATION	0.99+
Josh	PERSON	0.99+
Home Depot	ORGANIZATION	0.99+
24 vendors	QUANTITY	0.99+
Microsoft	ORGANIZATION	0.99+
Yankees'	ORGANIZATION	0.99+
thousands	QUANTITY	0.99+
US	LOCATION	0.99+
Excel	TITLE	0.99+
last year	DATE	0.99+
Amazon	ORGANIZATION	0.99+
100%	QUANTITY	0.99+
San Jose, California	LOCATION	0.99+
last week	DATE	0.99+
Silicon Valley	LOCATION	0.99+
AtScale	ORGANIZATION	0.99+
American Express	ORGANIZATION	0.99+
first one	QUANTITY	0.99+
first	QUANTITY	0.99+
20 years ago	DATE	0.99+
50	QUANTITY	0.98+
2017	DATE	0.98+
Tableau	TITLE	0.98+
Macy's	ORGANIZATION	0.98+
One	QUANTITY	0.98+
Mork	ORGANIZATION	0.98+
power BI	TITLE	0.98+
Ecosystem	ORGANIZATION	0.98+
Sequel	PERSON	0.97+
Google	ORGANIZATION	0.97+
this week	DATE	0.97+
Power BI	TITLE	0.97+
Cloudera	ORGANIZATION	0.96+
15 different BI tools	QUANTITY	0.95+
past year and a half	DATE	0.95+
over ten years	QUANTITY	0.95+
today	DATE	0.95+
Tableu	TITLE	0.94+
Tableau	ORGANIZATION	0.94+
SQL	TITLE	0.93+
Astro Data	ORGANIZATION	0.93+
Cube	ORGANIZATION	0.92+
Wikibon	ORGANIZATION	0.92+
two key differentiators	QUANTITY	0.92+
AtScale	TITLE	0.91+
Care Data	ORGANIZATION	0.9+
about 10x	QUANTITY	0.9+
Spark Sequel	TITLE	0.89+
two top ones	QUANTITY	0.89+
Hadoop	TITLE	0.88+
Athena	ORGANIZATION	0.87+
two great products	QUANTITY	0.87+
Big Query	TITLE	0.86+
The Cube	ORGANIZATION	0.85+
Big Data	ORGANIZATION	0.85+

Holden Karau, IBM - #BigDataNYC 2016 - #theCUBE

>> Narrator: Live from New York, it's the CUBE from Big Data New York City 2016. Brought to you by headline sponsors, Cisco, IBM, Nvidia. And our ecosystem sponsors. Now, here are your hosts: Dave Vellante and Peter Burris. >> Welcome back to New York City, everybody. This is the CUBE, the worldwide leader in live tech coverage. Holden Karau is here, principle software engineer with IBM. Welcome to the CUBE. >> Thank you for having me. It's nice to be back. >> So, what's with Boo? >> So, Boo is my stuffed dog that I bring-- >> You've got to hold Boo up. >> Okay, yeah. >> Can't see Boo. >> So, this is Boo. Boo comes with me to all of my conferences in case I get stressed out. And she also hangs out normally on the podium while I'm giving the talk as well, just in case people get bored. You know, they can look at Boo. >> So, Boo is not some new open source project. >> No, no, Boo is not an open source project. But Boo is really cute. So, that counts for something. >> All right, so, what's new in your world of spark and machinery? >> So, there's a lot of really exciting things, right. Spark 2.0.0 came out, and that's really exciting because we finally got to get rid of some of the chunkier APIs. And data sets are just becoming sort of the core base of everything going forward in Spark. This is bringing the Spark Sequel engine to all sorts of places, right. So, the machine learning APIs are built on top of the data set API now. The streaming APIs are being built on top of the data set APIs. And this is starting to actually make it a lot easier for people to work together, I think. And that's one of the things that I really enjoy is when we can have people from different sort of profiles or roles work together. And so this support of data sets being everywhere in Spark now lets people with more of like a Sequel background still write stuff that's going to be used directly in sort of a production pipeline. And the engineers can build whatever, you know, production ready stuff they need on top of the Sequel expressions from the analysts and do some really cool stuff there. >> So, chunky API, what does that mean to a layperson? >> Sure, um, it means like, for example, there's this thing in Spark where one of the things you want to do is shuffle a whole bunch of data around and then look at all of the records associated with a given key, right? But, you know, when the APIs were first made, right, it was made by university students. Very smart university students, but you know, it started out as like a grad school project, right? And like, um, so finally with 2.0, we were about to get rid of things like places where we use traits like iterables rather than iterators. And because like these minor little drunky things it's like we had to keep supporting this old API, because you can't break people's code in a minor release, but when you do a big release like Spark 2.0, you can actually go, okay, you need to change your stuff now to start using Spark 2.0. But as a result of changing that in this one place, we're actually able to better support spilling to disk. And this is for people who have too much data to fit in memory even on the individual executors. So, being able to spill to disk more effectively is really important from a performance point of view. So, there's a lot of clean up of getting rid of things, which were sort of holding us back performance-wise. >> So, the value is there. Enough value to break the-- >> Yeah, enough value to break the APIs. And 1.6 will continue to be updated for people that are not ready to migrate right today. But for the people that are looking at it, it's definitely worth it, right? You get a bunch of real cool optimizations. >> One of the themes of this event of the last couple of years has been complexity. You guys wrote an article recently in SiliconANGLE some of the broken promises of open source, really the route of it, being complexity. So, Spark addresses that to a large degree. >> I think so. >> Maybe you could talk about that and explain to us sort of how and what the impact could be for businesses. >> So, I think Spark does a really good job of being really user-friendly, right? It has a Sequel engine for people that aren't comfortable with writing, you know, Scala or Java or Python code. But then on top of that, right, there's a lot of analysts that are really familiar with Python. And Spark actually exposes Python APIs and is working on exposing R APIs. And this is making it so that if you're working on Spark, you don't have to understand the internals in a lot of depth, right? There's some other streaming systems where to make them perform really well, you have to have a really deep mental model of what you're doing. But with Spark, it's much simpler and the APIs are cleaner, and they're exposed in the ways that people are already used to working with their data. And because it's exposed in ways that people are used to working with their data, they don't have to relearn large amounts of complexity. They just have to learn it in the few cases where they run into problems, right? Because it will work most of the time just with the sort of techniques that they're used to doing. So, I think that it's really cool. Especially structured streaming, which is new in Spark 2.0. And structured streaming makes it so that you can write sort of arbitrary Sequel expressions on streaming data, which is really awesome. Like, you can do aggregations without having to sit around and think about how to effectively do an aggregation over different microbatches. That's not a problem for you to worry about. That's a problem for the Spark developers to worry about. Which, unfortunately, is sometimes a problem for me to worry about, but you know, not too often. Boo helps out whenever it gets too stressful. >> First of all, a lot to learn. But there's been some great research done in places like Cornell and Penn and others about how the open source community collaborates and works together. And I'm wondering is the open source community that's building things like Spark, especially in a domain like Big Data, which the use cases themselves are so complex and so important. Are we starting to take some of the knowledge in the contributors, or developing, on how to collaborate and how to work together. And starting to find that way into the tools so that the whole thing starts to collaborate better? >> Yeah, I think, actually, if you look at Spark, you can see that there's a lot of sort of tools that are being built on top of Spark, which are also being built in similar models. I mean, the Apache Software Foundation is a really good tool for managing projects of a certain scale. You can see a lot of Spark-related projects that have also decided that become part of Apache Foundation is a good way to manage their governance and collaborate with different people. But then there's people that look at Spark and go like wow, there's a lot of overhead here. I don't think I'm going to have 500 people working on this project. I'm going to go and model my project after something a bit simpler, right? And I think that both of those are really valid ways of building open source tools on Spark. But it's really interesting seeing there's a Spark components page, essentially, a Spark packages list, for community to publish the work that they're doing on top of Spark. And it's really interesting to see all of the collaborations that are happening there. Especially even between vendors sometimes. You'll see people make tools, which help everyone's data access go faster. And it's open source. so you'll see it start to get contributed into other people's data access layers as well. >> So, pedagogy of how the open source community's work starting to find a way into the tools, so people who aren't in the community, but are focused on the outcomes are now able to not only gain the experience about how the big data works, but also how people on complex outcomes need to work. >> I think that's definitely happening. And you can see that a lot with, like, the collaboration layers that different people are building on top of Spark, like the different notebook solutions, are all very focused on ableing collaboration, right? Because if you're an analyst and you're writing some python code on your local machine, you're not going to, like, probably set up a get up recode to share that with everyone, right? But if you have a notebook and you can just send the link to your friends and be like hey, what's up, can you take a look at this? You can share your results more easily and you can also work together a lot more, more collaboratively. And then so data bricks is doing some great things. IBM as well. I'm sure there's other companies building great notebook solutions who I'm forgetting. But the notebooks, I think, are really empowering people to collaborate in ways that we haven't traditionally seen in the big data space before. >> So, collaboration, to stay on that theme. So, we had eight data scientists on a panel the other night and just talking about, collaboration came up, and the question is specifically from an application developer standpoint. As data becomes, you know, the new development kit, how much of a data scientist do you have to become or are you becoming as a developer? >> Right, so, my role is very different, right? Because I focus just on tools, mostly. So, my data science is mostly to make sure that what I'm doing is actually useful to other people. Because a lot of the people that consume my stuff are data scientists. So, for me, personally, like the answer is not a whole lot. But for a lot of my friends that are working in more traditional sort of data engineering roles where they're empowering specific use cases, they find themselves either working really closely with data scientists often to be like, okay, what are your requirements? What data do I need to be able to get to you so you can do your job? And, you know, sometimes if they find themselves blocking on the data scientists, they're like, how hard could it be? And it turns out, you know, statistics is actually pretty complicated. But sometimes, you know, they go ahead and pick up some of the tools on their own. And we get to see really cool things with really, really ugly graphs. 'Cause they do not know how to use graphing libraries. But, you know, it's really exciting. >> Machine learning is another big theme in this conference. Maybe you could share with us your perspectives on ML and what's happening there. >> So, I really thing machine learning is very powerful. And I think machine learning in Spark is also super powerful. And especially just like the traditional things is you down-sample your data. And you train a bunch of your models. And then, eventually, you're like okay, I think this is like the model that I want to like build for real. And then you go and you get your engineer to help you train it on your giant data set. But Spark and the notebooks that are built on top of it actually mean that it's entirely reasonable for data scientists to take the tools which are traditionally used by the data engineering roles, and just start directly applying them during their exploration phase. And so we're seeing a lot of really more interesting models come to life, right? Because if you're always working with down-sampled data, it's okay, right? Like you can do reasonable exploration on down-sampled data. But you can find some really cool sort of features that you wouldn't normally find once you're working with your full data set, right? 'Cause you're just not going to have that show up in your down-sampled data. And I think also streaming machine learning is a really interesting thing, right? Because we see there's a lot of IOT devices and stuff like that. And like the traditional machine learning thing is I'm going to build a model and then I'm going to deploy it. And then like a week later, I'll maybe consider building a new model. And then I'll deploy it. And then so very much it looks like the old software release processes as opposed to the more agile software release processes. And I think that streaming machine learning can look a lot more like, sort of the agile software development processes where it's like cool, I've got a bunch of labeled data from our contractors. I'm going to integrate that right away. And if I don't see any regression on my cross-validation set, we're just going to go ahead and deploy that today. And I think it's really exciting. I'm obviously a little biased, because some of my work right now is on enabling machine learning with structured streaming in Spark. So, I obviously think my work is useful. Otherwise I would be doing something else. But it's entirely possible. You know, everyone will be like Holden, your work is terrible. But I hope not. I hope people find it useful. >> Talking about sampling. In our first at Dupe World 2010, Albi Meta, he stopped by again today, of course, and he made the statement then. Sampling's dead. It's dead. Is sampling dead? >> Sampling didn't quite die. I think we're getting really close to killing sampling. Sampling will only be data once all of the data scientists in the organization have access to the same tools that the data engineers have been using, right? 'Cause otherwise you'll still be sampling. You'll still be implicitly doing your model selection on down-sampled data. And we'll still probably always find an excuse to sample data, because I'm lazy and sometimes I just want to develop on my laptop. But, you know, I think we're getting close to killing a lot more of sampling. >> Do you see an opportunity to start utilizing many of these tools to actually improve the process of building models, finding data sources, identifying individuals that need access to the data? Are we going to start turning big data on the problem of big data? >> No, that's really exciting. And so, okay, so this is something that I find really enjoyable. So, one of the things that traditionally, when everyone's doing their development on their laptop, right? You don't get to collect a lot of metrics about what they're doing, right? But once you start moving everyone into a sort of more integrated notebook environment, you can be like, okay, like, these are data sets that these different people are accessing. Like these are the things that I know about them. And you can actually train a recommendation algorithm on the data sets to recommend other data sets to people. And there are people that are starting to do this. And I think it's really powerful, right? Because it's like in small companies, maybe not super important, right? Because I'll just go an ask my coworker like hey, what data sets do I want to use? But if you're at a company like Google or IBM scale or even like a 500 person company, you're not going to know all of the data sets that are available for you to work with. And the machine will actually be able to make some really interesting recommendations there. >> All right, we have to leave it there. We're out of time. Holden, thanks very much. >> Thank you so much for having me and having Boo. >> Pleasure. All right, any time. Keep right there everybody. We'll be back with our next guest. This is the CUBE. We're live from New York City. We'll be right back.

Published Date : Sep 30 2016

SUMMARY :

Brought to you by headline sponsors, This is the CUBE, the worldwide leader It's nice to be back. normally on the podium So, Boo is not some So, that counts for something. And this is starting to So, being able to spill So, the value is there. But for the people that are looking at it, that to a large degree. about that and explain to us and think about how to And starting to find And it's really interesting to but are focused on the outcomes the link to your friends and the question is specifically be able to get to you Maybe you could share with And then you go and you get your engineer and he made the statement then. that the data engineers on the data sets to recommend All right, we have to leave it there. Thank you so much for This is the CUBE.

ENTITIES

Entity	Category	Confidence
IBM	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
Dave Vellante	PERSON	0.99+
Nvidia	ORGANIZATION	0.99+
Peter Burris	PERSON	0.99+
Cisco	ORGANIZATION	0.99+
Holden Karau	PERSON	0.99+
New York City	LOCATION	0.99+
Java	TITLE	0.99+
Apache Foundation	ORGANIZATION	0.99+
Scala	TITLE	0.99+
New York City	LOCATION	0.99+
Python	TITLE	0.99+
Spark 2.0	TITLE	0.99+
Spark	TITLE	0.99+
500 people	QUANTITY	0.99+
Albi Meta	PERSON	0.99+
a week later	DATE	0.99+
Spark 2.0.0	TITLE	0.99+
500 person	QUANTITY	0.99+
Apache Software Foundation	ORGANIZATION	0.98+
New York	LOCATION	0.98+
today	DATE	0.98+
Holden	PERSON	0.98+
first	QUANTITY	0.98+
both	QUANTITY	0.98+
Cornell	ORGANIZATION	0.97+
Boo	PERSON	0.97+
One	QUANTITY	0.96+
Spark Sequel	TITLE	0.95+
CUBE	ORGANIZATION	0.93+
eight data scientists	QUANTITY	0.93+
python code	TITLE	0.93+
2016	DATE	0.91+
one	QUANTITY	0.91+
First	QUANTITY	0.9+
Penn	ORGANIZATION	0.89+
last couple of years	DATE	0.88+
Big Data	ORGANIZATION	0.86+
one place	QUANTITY	0.85+
2.0	TITLE	0.8+
agile	TITLE	0.79+
one of	QUANTITY	0.75+
things	QUANTITY	0.73+
once	QUANTITY	0.7+
#BigDataNYC	EVENT	0.7+
2010	DATE	0.65+
Dupe	EVENT	0.6+
World	ORGANIZATION	0.56+
Data	TITLE	0.53+
themes	QUANTITY	0.52+
1.6	OTHER	0.5+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Spark Sequel: