Data Power Panel V3

(upbeat music) >> The stampede to cloud and massive VC investments has led to the emergence of a new generation of object store based data lakes. And with them two important trends, actually three important trends. First, a new category that combines data lakes and data warehouses aka the lakehouse is emerged as a leading contender to be the data platform of the future. And this novelty touts the ability to address data engineering, data science, and data warehouse workloads on a single shared data platform. The other major trend we've seen is query engines and broader data fabric virtualization platforms have embraced NextGen data lakes as platforms for SQL centric business intelligence workloads, reducing, or somebody even claim eliminating the need for separate data warehouses. Pretty bold. However, cloud data warehouses have added complimentary technologies to bridge the gaps with lakehouses. And the third is many, if not most customers that are embracing the so-called data fabric or data mesh architectures. They're looking at data lakes as a fundamental component of their strategies, and they're trying to evolve them to be more capable, hence the interest in lakehouse, but at the same time, they don't want to, or can't abandon their data warehouse estate. As such we see a battle royale is brewing between cloud data warehouses and cloud lakehouses. Is it possible to do it all with one cloud center analytical data platform? Well, we're going to find out. My name is Dave Vellante and welcome to the data platform's power panel on theCUBE. Our next episode in a series where we gather some of the industry's top analysts to talk about one of our favorite topics, data. In today's session, we'll discuss trends, emerging options, and the trade offs of various approaches and we'll name names. Joining us today are Sanjeev Mohan, who's the principal at SanjMo, Tony Baers, principal at dbInsight. And Doug Henschen is the vice president and principal analyst at Constellation Research. Guys, welcome back to theCUBE. Great to see you again. >> Thank guys. Thank you. >> Thank you. >> So it's early June and we're gearing up with two major conferences, there's several database conferences, but two in particular that were very interested in, Snowflake Summit and Databricks Data and AI Summit. Doug let's start off with you and then Tony and Sanjeev, if you could kindly weigh in. Where did this all start, Doug? The notion of lakehouse. And let's talk about what exactly we mean by lakehouse. Go ahead. >> Yeah, well you nailed it in your intro. One platform to address BI data science, data engineering, fewer platforms, less cost, less complexity, very compelling. You can credit Databricks for coining the term lakehouse back in 2020, but it's really a much older idea. You can go back to Cloudera introducing their Impala database in 2012. That was a database on top of Hadoop. And indeed in that last decade, by the middle of that last decade, there were several SQL on Hadoop products, open standards like Apache Drill. And at the same time, the database vendors were trying to respond to this interest in machine learning and the data science. So they were adding SQL extensions, the likes Hudi and Vertical we're adding SQL extensions to support the data science. But then later in that decade with the shift to cloud and object storage, you saw the vendor shift to this whole cloud, and object storage idea. So you have in the database camp Snowflake introduce Snowpark to try to address the data science needs. They introduced that in 2020 and last year they announced support for Python. You also had Oracle, SAP jumped on this lakehouse idea last year, supporting both the lake and warehouse single vendor, not necessarily quite single platform. Google very recently also jumped on the bandwagon. And then you also mentioned, the SQL engine camp, the Dremios, the Ahanas, the Starbursts, really doing two things, a fabric for distributed access to many data sources, but also very firmly planning that idea that you can just have the lake and we'll help you do the BI workloads on that. And then of course, the data lake camp with the Databricks and Clouderas providing a warehouse style deployments on top of their lake platforms. >> Okay, thanks, Doug. I'd be remiss those of you who me know that I typically write my own intros. This time my colleagues fed me a lot of that material. So thank you. You guys make it easy. But Tony, give us your thoughts on this intro. >> Right. Well, I very much agree with both of you, which may not make for the most exciting television in terms of that it has been an evolution just like Doug said. I mean, for instance, just to give an example when Teradata bought AfterData was initially seen as a hardware platform play. In the end, it was basically, it was all those after functions that made a lot of sort of big data analytics accessible to SQL. (clears throat) And so what I really see just in a more simpler definition or functional definition, the data lakehouse is really an attempt by the data lake folks to make the data lake friendlier territory to the SQL folks, and also to get into friendly territory, to all the data stewards, who are basically concerned about the sprawl and the lack of control in governance in the data lake. So it's really kind of a continuing of an ongoing trend that being said, there's no action without counter action. And of course, at the other end of the spectrum, we also see a lot of the data warehouses starting to edit things like in database machine learning. So they're certainly not surrendering without a fight. Again, as Doug was mentioning, this has been part of a continual blending of platforms that we've seen over the years that we first saw in the Hadoop years with SQL on Hadoop and data warehouses starting to reach out to cloud storage or should say the HDFS and then with the cloud then going cloud native and therefore trying to break the silos down even further. >> Now, thank you. And Sanjeev, data lakes, when we first heard about them, there were such a compelling name, and then we realized all the problems associated with them. So pick it up from there. What would you add to Doug and Tony? >> I would say, these are excellent points that Doug and Tony have brought to light. The concept of lakehouse was going on to your point, Dave, a long time ago, long before the tone was invented. For example, in Uber, Uber was trying to do a mix of Hadoop and Vertical because what they really needed were transactional capabilities that Hadoop did not have. So they weren't calling it the lakehouse, they were using multiple technologies, but now they're able to collapse it into a single data store that we call lakehouse. Data lakes, excellent at batch processing large volumes of data, but they don't have the real time capabilities such as change data capture, doing inserts and updates. So this is why lakehouse has become so important because they give us these transactional capabilities. >> Great. So I'm interested, the name is great, lakehouse. The concept is powerful, but I get concerned that it's a lot of marketing hype behind it. So I want to examine that a bit deeper. How mature is the concept of lakehouse? Are there practical examples that really exist in the real world that are driving business results for practitioners? Tony, maybe you could kick that off. >> Well, put it this way. I think what's interesting is that both data lakes and data warehouse that each had to extend themselves. To believe the Databricks hype it's that this was just a natural extension of the data lake. In point of fact, Databricks had to go outside its core technology of Spark to make the lakehouse possible. And it's a very similar type of thing on the part with data warehouse folks, in terms of that they've had to go beyond SQL, In the case of Databricks. There have been a number of incremental improvements to Delta lake, to basically make the table format more performative, for instance. But the other thing, I think the most dramatic change in all that is in their SQL engine and they had to essentially pretty much abandon Spark SQL because it really, in off itself Spark SQL is essentially stop gap solution. And if they wanted to really address that crowd, they had to totally reinvent SQL or at least their SQL engine. And so Databricks SQL is not Spark SQL, it is not Spark, it's basically SQL that it's adapted to run in a Spark environment, but the underlying engine is C++, it's not scale or anything like that. So Databricks had to take a major detour outside of its core platform to do this. So to answer your question, this is not mature because these are all basically kind of, even though the idea of blending platforms has been going on for well over a decade, I would say that the current iteration is still fairly immature. And in the cloud, I could see a further evolution of this because if you think through cloud native architecture where you're essentially abstracting compute from data, there is no reason why, if let's say you are dealing with say, the same basically data targets say cloud storage, cloud object storage that you might not apportion the task to different compute engines. And so therefore you could have, for instance, let's say you're Google, you could have BigQuery, perform basically the types of the analytics, the SQL analytics that would be associated with the data warehouse and you could have BigQuery ML that does some in database machine learning, but at the same time for another part of the query, which might involve, let's say some deep learning, just for example, you might go out to let's say the serverless spark service or the data proc. And there's no reason why Google could not blend all those into a coherent offering that's basically all triggered through microservices. And I just gave Google as an example, if you could generalize that with all the other cloud or all the other third party vendors. So I think we're still very early in the game in terms of maturity of data lakehouses. >> Thanks, Tony. So Sanjeev, is this all hype? What are your thoughts? >> It's not hype, but completely agree. It's not mature yet. Lakehouses have still a lot of work to do, so what I'm now starting to see is that the world is dividing into two camps. On one hand, there are people who don't want to deal with the operational aspects of vast amounts of data. They are the ones who are going for BigQuery, Redshift, Snowflake, Synapse, and so on because they want the platform to handle all the data modeling, access control, performance enhancements, but these are trade off. If you go with these platforms, then you are giving up on vendor neutrality. On the other side are those who have engineering skills. They want the independence. In other words, they don't want vendor lock in. They want to transform their data into any number of use cases, especially data science, machine learning use case. What they want is agility via open file formats using any compute engine. So why do I say lakehouses are not mature? Well, cloud data warehouses they provide you an excellent user experience. That is the main reason why Snowflake took off. If you have thousands of cables, it takes minutes to get them started, uploaded into your warehouse and start experimentation. Table formats are far more resonating with the community than file formats. But once the cost goes up of cloud data warehouse, then the organization start exploring lakehouses. But the problem is lakehouses still need to do a lot of work on metadata. Apache Hive was a fantastic first attempt at it. Even today Apache Hive is still very strong, but it's all technical metadata and it has so many different restrictions. That's why we see Databricks is investing into something called Unity Catalog. Hopefully we'll hear more about Unity Catalog at the end of the month. But there's a second problem. I just want to mention, and that is lack of standards. All these open source vendors, they're running, what I call ego projects. You see on LinkedIn, they're constantly battling with each other, but end user doesn't care. End user wants a problem to be solved. They want to use Trino, Dremio, Spark from EMR, Databricks, Ahana, DaaS, Frink, Athena. But the problem is that we don't have common standards. >> Right. Thanks. So Doug, I worry sometimes. I mean, I look at the space, we've debated for years, best of breed versus the full suite. You see AWS with whatever, 12 different plus data stores and different APIs and primitives. You got Oracle putting everything into its database. It's actually done some interesting things with MySQL HeatWave, so maybe there's proof points there, but Snowflake really good at data warehouse, simplifying data warehouse. Databricks, really good at making lakehouses actually more functional. Can one platform do it all? >> Well in a word, I can't be best at breed at all things. I think the upshot of and cogen analysis from Sanjeev there, the database, the vendors coming out of the database tradition, they excel at the SQL. They're extending it into data science, but when it comes to unstructured data, data science, ML AI often a compromise, the data lake crowd, the Databricks and such. They've struggled to completely displace the data warehouse when it really gets to the tough SLAs, they acknowledge that there's still a role for the warehouse. Maybe you can size down the warehouse and offload some of the BI workloads and maybe and some of these SQL engines, good for ad hoc, minimize data movement. But really when you get to the deep service level, a requirement, the high concurrency, the high query workloads, you end up creating something that's warehouse like. >> Where do you guys think this market is headed? What's going to take hold? Which projects are going to fade away? You got some things in Apache projects like Hudi and Iceberg, where do they fit Sanjeev? Do you have any thoughts on that? >> So thank you, Dave. So I feel that table formats are starting to mature. There is a lot of work that's being done. We will not have a single product or single platform. We'll have a mixture. So I see a lot of Apache Iceberg in the news. Apache Iceberg is really innovating. Their focus is on a table format, but then Delta and Apache Hudi are doing a lot of deep engineering work. For example, how do you handle high concurrency when there are multiple rights going on? Do you version your Parquet files or how do you do your upcerts basically? So different focus, at the end of the day, the end user will decide what is the right platform, but we are going to have multiple formats living with us for a long time. >> Doug is Iceberg in your view, something that's going to address some of those gaps in standards that Sanjeev was talking about earlier? >> Yeah, Delta lake, Hudi, Iceberg, they all address this need for consistency and scalability, Delta lake open technically, but open for access. I don't hear about Delta lakes in any worlds, but Databricks, hearing a lot of buzz about Apache Iceberg. End users want an open performance standard. And most recently Google embraced Iceberg for its recent a big lake, their stab at having supporting both lakes and warehouses on one conjoined platform. >> And Tony, of course, you remember the early days of the sort of big data movement you had MapR was the most closed. You had Horton works the most open. You had Cloudera in between. There was always this kind of contest as to who's the most open. Does that matter? Are we going to see a repeat of that here? >> I think it's spheres of influence, I think, and Doug very much was kind of referring to this. I would call it kind of like the MongoDB syndrome, which is that you have... and I'm talking about MongoDB before they changed their license, open source project, but very much associated with MongoDB, which basically, pretty much controlled most of the contributions made decisions. And I think Databricks has the same iron cloud hold on Delta lake, but still the market is pretty much associated Delta lake as the Databricks, open source project. I mean, Iceberg is probably further advanced than Hudi in terms of mind share. And so what I see that's breaking down to is essentially, basically the Databricks open source versus the everything else open source, the community open source. So I see it's a very similar type of breakdown that I see repeating itself here. >> So by the way, Mongo has a conference next week, another data platform is kind of not really relevant to this discussion totally. But in the sense it is because there's a lot of discussion on earnings calls these last couple of weeks about consumption and who's exposed, obviously people are concerned about Snowflake's consumption model. Mongo is maybe less exposed because Atlas is prominent in the portfolio, blah, blah, blah. But I wanted to bring up the little bit of controversy that we saw come out of the Snowflake earnings call, where the ever core analyst asked Frank Klutman about discretionary spend. And Frank basically said, look, we're not discretionary. We are deeply operationalized. Whereas he kind of poo-pooed the lakehouse or the data lake, et cetera, saying, oh yeah, data scientists will pull files out and play with them. That's really not our business. Do any of you have comments on that? Help us swing through that controversy. Who wants to take that one? >> Let's put it this way. The SQL folks are from Venus and the data scientists are from Mars. So it means it really comes down to it, sort that type of perception. The fact is, is that, traditionally with analytics, it was very SQL oriented and that basically the quants were kind of off in their corner, where they're using SaaS or where they're using Teradata. It's really a great leveler today, which is that, I mean basic Python it's become arguably one of the most popular programming languages, depending on what month you're looking at, at the title index. And of course, obviously SQL is, as I tell the MongoDB folks, SQL is not going away. You have a large skills base out there. And so basically I see this breaking down to essentially, you're going to have each group that's going to have its own natural preferences for its home turf. And the fact that basically, let's say the Python and scale of folks are using Databricks does not make them any less operational or machine critical than the SQL folks. >> Anybody else want to chime in on that one? >> Yeah, I totally agree with that. Python support in Snowflake is very nascent with all of Snowpark, all of the things outside of SQL, they're very much relying on partners too and make things possible and make data science possible. And it's very early days. I think the bottom line, what we're going to see is each of these camps is going to keep working on doing better at the thing that they don't do today, or they're new to, but they're not going to nail it. They're not going to be best of breed on both sides. So the SQL centric companies and shops are going to do more data science on their database centric platform. That data science driven companies might be doing more BI on their leagues with those vendors and the companies that have highly distributed data, they're going to add fabrics, and maybe offload more of their BI onto those engines, like Dremio and Starburst. >> So I've asked you this before, but I'll ask you Sanjeev. 'Cause Snowflake and Databricks are such great examples 'cause you have the data engineering crowd trying to go into data warehousing and you have the data warehousing guys trying to go into the lake territory. Snowflake has $5 billion in the balance sheet and I've asked you before, I ask you again, doesn't there has to be a semantic layer between these two worlds? Does Snowflake go out and do M&A and maybe buy ad scale or a data mirror? Or is that just sort of a bandaid? What are your thoughts on that Sanjeev? >> I think semantic layer is the metadata. The business metadata is extremely important. At the end of the day, the business folks, they'd rather go to the business metadata than have to figure out, for example, like let's say, I want to update somebody's email address and we have a lot of overhead with data residency laws and all that. I want my platform to give me the business metadata so I can write my business logic without having to worry about which database, which location. So having that semantic layer is extremely important. In fact, now we are taking it to the next level. Now we are saying that it's not just a semantic layer, it's all my KPIs, all my calculations. So how can I make those calculations independent of the compute engine, independent of the BI tool and make them fungible. So more disaggregation of the stack, but it gives us more best of breed products that the customers have to worry about. >> So I want to ask you about the stack, the modern data stack, if you will. And we always talk about injecting machine intelligence, AI into applications, making them more data driven. But when you look at the application development stack, it's separate, the database is tends to be separate from the data and analytics stack. Do those two worlds have to come together in the modern data world? And what does that look like organizationally? >> So organizationally even technically I think it is starting to happen. Microservices architecture was a first attempt to bring the application and the data world together, but they are fundamentally different things. For example, if an application crashes, that's horrible, but Kubernetes will self heal and it'll bring the application back up. But if a database crashes and corrupts your data, we have a huge problem. So that's why they have traditionally been two different stacks. They are starting to come together, especially with data ops, for instance, versioning of the way we write business logic. It used to be, a business logic was highly embedded into our database of choice, but now we are disaggregating that using GitHub, CICD the whole DevOps tool chain. So data is catching up to the way applications are. >> We also have databases, that trans analytical databases that's a little bit of what the story is with MongoDB next week with adding more analytical capabilities. But I think companies that talk about that are always careful to couch it as operational analytics, not the warehouse level workloads. So we're making progress, but I think there's always going to be, or there will long be a separate analytical data platform. >> Until data mesh takes over. (all laughing) Not opening a can of worms. >> Well, but wait, I know it's out of scope here, but wouldn't data mesh say, hey, do take your best of breed to Doug's earlier point. You can't be best of breed at everything, wouldn't data mesh advocate, data lakes do your data lake thing, data warehouse, do your data lake, then you're just a node on the mesh. (Tony laughs) Now you need separate data stores and you need separate teams. >> To my point. >> I think, I mean, put it this way. (laughs) Data mesh itself is a logical view of the world. The data mesh is not necessarily on the lake or on the warehouse. I think for me, the fear there is more in terms of, the silos of governance that could happen and the silo views of the world, how we redefine. And that's why and I want to go back to something what Sanjeev said, which is that it's going to be raising the importance of the semantic layer. Now does Snowflake that opens a couple of Pandora's boxes here, which is one, does Snowflake dare go into that space or do they risk basically alienating basically their partner ecosystem, which is a key part of their whole appeal, which is best of breed. They're kind of the same situation that Informatica was where in the early 2000s, when Informatica briefly flirted with analytic applications and realized that was not a good idea, need to redouble down on their core, which was data integration. The other thing though, that raises the importance of and this is where the best of breed comes in, is the data fabric. My contention is that and whether you use employee data mesh practice or not, if you do employee data mesh, you need data fabric. If you deploy data fabric, you don't necessarily need to practice data mesh. But data fabric at its core and admittedly it's a category that's still very poorly defined and evolving, but at its core, we're talking about a common meta data back plane, something that we used to talk about with master data management, this would be something that would be more what I would say basically, mutable, that would be more evolving, basically using, let's say, machine learning to kind of, so that we don't have to predefine rules or predefine what the world looks like. But so I think in the long run, what this really means is that whichever way we implement on whichever physical platform we implement, we need to all be speaking the same metadata language. And I think at the end of the day, regardless of whether it's a lake, warehouse or a lakehouse, we need common metadata. >> Doug, can I come back to something you pointed out? That those talking about bringing analytic and transaction databases together, you had talked about operationalizing those and the caution there. Educate me on MySQL HeatWave. I was surprised when Oracle put so much effort in that, and you may or may not be familiar with it, but a lot of folks have talked about that. Now it's got nowhere in the market, that no market share, but a lot of we've seen these benchmarks from Oracle. How real is that bringing together those two worlds and eliminating ETL? >> Yeah, I have to defer on that one. That's my colleague, Holger Mueller. He wrote the report on that. He's way deep on it and I'm not going to mock him. >> I wonder if that is something, how real that is or if it's just Oracle marketing, anybody have any thoughts on that? >> I'm pretty familiar with HeatWave. It's essentially Oracle doing what, I mean, there's kind of a parallel with what Google's doing with AlloyDB. It's an operational database that will have some embedded analytics. And it's also something which I expect to start seeing with MongoDB. And I think basically, Doug and Sanjeev were kind of referring to this before about basically kind of like the operational analytics, that are basically embedded within an operational database. The idea here is that the last thing you want to do with an operational database is slow it down. So you're not going to be doing very complex deep learning or anything like that, but you might be doing things like classification, you might be doing some predictives. In other words, we've just concluded a transaction with this customer, but was it less than what we were expecting? What does that mean in terms of, is this customer likely to turn? I think we're going to be seeing a lot of that. And I think that's what a lot of what MySQL HeatWave is all about. Whether Oracle has any presence in the market now it's still a pretty new announcement, but the other thing that kind of goes against Oracle, (laughs) that they had to battle against is that even though they own MySQL and run the open source project, everybody else, in terms of the actual commercial implementation it's associated with everybody else. And the popular perception has been that MySQL has been basically kind of like a sidelight for Oracle. And so it's on Oracles shoulders to prove that they're damn serious about it. >> There's no coincidence that MariaDB was launched the day that Oracle acquired Sun. Sanjeev, I wonder if we could come back to a topic that we discussed earlier, which is this notion of consumption, obviously Wall Street's very concerned about it. Snowflake dropped prices last week. I've always felt like, hey, the consumption model is the right model. I can dial it down in when I need to, of course, the street freaks out. What are your thoughts on just pricing, the consumption model? What's the right model for companies, for customers? >> Consumption model is here to stay. What I would like to see, and I think is an ideal situation and actually plays into the lakehouse concept is that, I have my data in some open format, maybe it's Parquet or CSV or JSON, Avro, and I can bring whatever engine is the best engine for my workloads, bring it on, pay for consumption, and then shut it down. And by the way, that could be Cloudera. We don't talk about Cloudera very much, but it could be one business unit wants to use Athena. Another business unit wants to use some other Trino let's say or Dremio. So every business unit is working on the same data set, see that's critical, but that data set is maybe in their VPC and they bring any compute engine, you pay for the use, shut it down. That then you're getting value and you're only paying for consumption. It's not like, I left a cluster running by mistake, so there have to be guardrails. The reason FinOps is so big is because it's very easy for me to run a Cartesian joint in the cloud and get a $10,000 bill. >> This looks like it's been a sort of a victim of its own success in some ways, they made it so easy to spin up single note instances, multi note instances. And back in the day when compute was scarce and costly, those database engines optimized every last bit so they could get as much workload as possible out of every instance. Today, it's really easy to spin up a new node, a new multi node cluster. So that freedom has meant many more nodes that aren't necessarily getting that utilization. So Snowflake has been doing a lot to add reporting, monitoring, dashboards around the utilization of all the nodes and multi node instances that have spun up. And meanwhile, we're seeing some of the traditional on-prem databases that are moving into the cloud, trying to offer that freedom. And I think they're going to have that same discovery that the cost surprises are going to follow as they make it easy to spin up new instances. >> Yeah, a lot of money went into this market over the last decade, separating compute from storage, moving to the cloud. I'm glad you mentioned Cloudera Sanjeev, 'cause they got it all started, the kind of big data movement. We don't talk about them that much. Sometimes I wonder if it's because when they merged Hortonworks and Cloudera, they dead ended both platforms, but then they did invest in a more modern platform. But what's the future of Cloudera? What are you seeing out there? >> Cloudera has a good product. I have to say the problem in our space is that there're way too many companies, there's way too much noise. We are expecting the end users to parse it out or we expecting analyst firms to boil it down. So I think marketing becomes a big problem. As far as technology is concerned, I think Cloudera did turn their selves around and Tony, I know you, you talked to them quite frequently. I think they have quite a comprehensive offering for a long time actually. They've created Kudu, so they got operational, they have Hadoop, they have an operational data warehouse, they're migrated to the cloud. They are in hybrid multi-cloud environment. Lot of cloud data warehouses are not hybrid. They're only in the cloud. >> Right. I think what Cloudera has done the most successful has been in the transition to the cloud and the fact that they're giving their customers more OnRamps to it, more hybrid OnRamps. So I give them a lot of credit there. They're also have been trying to position themselves as being the most price friendly in terms of that we will put more guardrails and governors on it. I mean, part of that could be spin. But on the other hand, they don't have the same vested interest in compute cycles as say, AWS would have with EMR. That being said, yes, Cloudera does it, I think its most powerful appeal so of that, it almost sounds in a way, I don't want to cast them as a legacy system. But the fact is they do have a huge landed legacy on-prem and still significant potential to land and expand that to the cloud. That being said, even though Cloudera is multifunction, I think it certainly has its strengths and weaknesses. And the fact this is that yes, Cloudera has an operational database or an operational data store with a kind of like the outgrowth of age base, but Cloudera is still based, primarily known for the deep analytics, the operational database nobody's going to buy Cloudera or Cloudera data platform strictly for the operational database. They may use it as an add-on, just in the same way that a lot of customers have used let's say Teradata basically to do some machine learning or let's say, Snowflake to parse through JSON. Again, it's not an indictment or anything like that, but the fact is obviously they do have their strengths and their weaknesses. I think their greatest opportunity is with their existing base because that base has a lot invested and vested. And the fact is they do have a hybrid path that a lot of the others lack. >> And of course being on the quarterly shock clock was not a good place to be under the microscope for Cloudera and now they at least can refactor the business accordingly. I'm glad you mentioned hybrid too. We saw Snowflake last month, did a deal with Dell whereby non-native Snowflake data could access on-prem object store from Dell. They announced a similar thing with pure storage. What do you guys make of that? Is that just... How significant will that be? Will customers actually do that? I think they're using either materialized views or extended tables. >> There are data rated and residency requirements. There are desires to have these platforms in your own data center. And finally they capitulated, I mean, Frank Klutman is famous for saying to be very focused and earlier, not many months ago, they called the going on-prem as a distraction, but clearly there's enough demand and certainly government contracts any company that has data residency requirements, it's a real need. So they finally addressed it. >> Yeah, I'll bet dollars to donuts, there was an EBC session and some big customer said, if you don't do this, we ain't doing business with you. And that was like, okay, we'll do it. >> So Dave, I have to say, earlier on you had brought this point, how Frank Klutman was poo-pooing data science workloads. On your show, about a year or so ago, he said, we are never going to on-prem. He burnt that bridge. (Tony laughs) That was on your show. >> I remember exactly the statement because it was interesting. He said, we're never going to do the halfway house. And I think what he meant is we're not going to bring the Snowflake architecture to run on-prem because it defeats the elasticity of the cloud. So this was kind of a capitulation in a way. But I think it still preserves his original intent sort of, I don't know. >> The point here is that every vendor will poo-poo whatever they don't have until they do have it. >> Yes. >> And then it'd be like, oh, we are all in, we've always been doing this. We have always supported this and now we are doing it better than others. >> Look, it was the same type of shock wave that we felt basically when AWS at the last moment at one of their reinvents, oh, by the way, we're going to introduce outposts. And the analyst group is typically pre briefed about a week or two ahead under NDA and that was not part of it. And when they dropped, they just casually dropped that in the analyst session. It's like, you could have heard the sound of lots of analysts changing their diapers at that point. >> (laughs) I remember that. And a props to Andy Jassy who once, many times actually told us, never say never when it comes to AWS. So guys, I know we got to run. We got some hard stops. Maybe you could each give us your final thoughts, Doug start us off and then-- >> Sure. Well, we've got the Snowflake Summit coming up. I'll be looking for customers that are really doing data science, that are really employing Python through Snowflake, through Snowpark. And then a couple weeks later, we've got Databricks with their Data and AI Summit in San Francisco. I'll be looking for customers that are really doing considerable BI workloads. Last year I did a market overview of this analytical data platform space, 14 vendors, eight of them claim to support lakehouse, both sides of the camp, Databricks customer had 32, their top customer that they could site was unnamed. It had 32 concurrent users doing 15,000 queries per hour. That's good but it's not up to the most demanding BI SQL workloads. And they acknowledged that and said, they need to keep working that. Snowflake asked for their biggest data science customer, they cited Kabura, 400 terabytes, 8,500 users, 400,000 data engineering jobs per day. I took the data engineering job to be probably SQL centric, ETL style transformation work. So I want to see the real use of the Python, how much Snowpark has grown as a way to support data science. >> Great. Tony. >> Actually of all things. And certainly, I'll also be looking for similar things in what Doug is saying, but I think sort of like, kind of out of left field, I'm interested to see what MongoDB is going to start to say about operational analytics, 'cause I mean, they're into this conquer the world strategy. We can be all things to all people. Okay, if that's the case, what's going to be a case with basically, putting in some inline analytics, what are you going to be doing with your query engine? So that's actually kind of an interesting thing we're looking for next week. >> Great. Sanjeev. >> So I'll be at MongoDB world, Snowflake and Databricks and very interested in seeing, but since Tony brought up MongoDB, I see that even the databases are shifting tremendously. They are addressing both the hashtag use case online, transactional and analytical. I'm also seeing that these databases started in, let's say in case of MySQL HeatWave, as relational or in MongoDB as document, but now they've added graph, they've added time series, they've added geospatial and they just keep adding more and more data structures and really making these databases multifunctional. So very interesting. >> It gets back to our discussion of best of breed, versus all in one. And it's likely Mongo's path or part of their strategy of course, is through developers. They're very developer focused. So we'll be looking for that. And guys, I'll be there as well. I'm hoping that we maybe have some extra time on theCUBE, so please stop by and we can maybe chat a little bit. Guys as always, fantastic. Thank you so much, Doug, Tony, Sanjeev, and let's do this again. >> It's been a pleasure. >> All right and thank you for watching. This is Dave Vellante for theCUBE and the excellent analyst. We'll see you next time. (upbeat music)

Published Date : Jun 2 2022

SUMMARY :

And Doug Henschen is the vice president Thank you. Doug let's start off with you And at the same time, me a lot of that material. And of course, at the and then we realized all the and Tony have brought to light. So I'm interested, the And in the cloud, So Sanjeev, is this all hype? But the problem is that we I mean, I look at the space, and offload some of the So different focus, at the end of the day, and warehouses on one conjoined platform. of the sort of big data movement most of the contributions made decisions. Whereas he kind of poo-pooed the lakehouse and the data scientists are from Mars. and the companies that have in the balance sheet that the customers have to worry about. the modern data stack, if you will. and the data world together, the story is with MongoDB Until data mesh takes over. and you need separate teams. that raises the importance of and the caution there. Yeah, I have to defer on that one. The idea here is that the of course, the street freaks out. and actually plays into the And back in the day when the kind of big data movement. We are expecting the end And the fact is they do have a hybrid path refactor the business accordingly. saying to be very focused And that was like, okay, we'll do it. So Dave, I have to say, the Snowflake architecture to run on-prem The point here is that and now we are doing that in the analyst session. And a props to Andy Jassy and said, they need to keep working that. Great. Okay, if that's the case, Great. I see that even the databases I'm hoping that we maybe have and the excellent analyst.

ENTITIES

Entity	Category	Confidence
Doug	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Dave	PERSON	0.99+
Tony	PERSON	0.99+
Uber	ORGANIZATION	0.99+
Frank	PERSON	0.99+
Frank Klutman	PERSON	0.99+
Tony Baers	PERSON	0.99+
Mars	LOCATION	0.99+
Doug Henschen	PERSON	0.99+
2020	DATE	0.99+
AWS	ORGANIZATION	0.99+
Venus	LOCATION	0.99+
Oracle	ORGANIZATION	0.99+
2012	DATE	0.99+
Databricks	ORGANIZATION	0.99+
Dell	ORGANIZATION	0.99+
Hortonworks	ORGANIZATION	0.99+
Holger Mueller	PERSON	0.99+
Andy Jassy	PERSON	0.99+
last year	DATE	0.99+
$5 billion	QUANTITY	0.99+
$10,000	QUANTITY	0.99+
14 vendors	QUANTITY	0.99+
Last year	DATE	0.99+
last week	DATE	0.99+
San Francisco	LOCATION	0.99+
SanjMo	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
8,500 users	QUANTITY	0.99+
Sanjeev	PERSON	0.99+
Informatica	ORGANIZATION	0.99+
32 concurrent users	QUANTITY	0.99+
two	QUANTITY	0.99+
Constellation Research	ORGANIZATION	0.99+
Mongo	ORGANIZATION	0.99+
Sanjeev Mohan	PERSON	0.99+
Ahana	ORGANIZATION	0.99+
DaaS	ORGANIZATION	0.99+
EMR	ORGANIZATION	0.99+
32	QUANTITY	0.99+
Atlas	ORGANIZATION	0.99+
Delta	ORGANIZATION	0.99+
Snowflake	ORGANIZATION	0.99+
Python	TITLE	0.99+
each	QUANTITY	0.99+
Athena	ORGANIZATION	0.99+
next week	DATE	0.99+

Sean Knapp, Ascend.io & Jason Robinson, Steady | AWS Startup Showcase

(upbeat music) >> Hello and welcome to today's session, theCUBE's presentation of the AWS Startup Showcase, New Breakthroughs in DevOps, Data Analytics, Cloud Management Tools, featuring Ascend.io for the data and analytics track. I'm your host, John Furrier with theCUBE. Today, we're proud joined by Sean Knapp, CEO and founder of Ascend.io and Jason Robinson who's the VP of Data Science and Engineering at Steady. Guys, thanks for coming on and congratulations, Sean, for the continued success, loves our cube conversation and Jason, nice to meet you. >> Great to meet you. >> Thanks for having us. >> So, the session today is really kind of looking at automating analytics workloads, right? So, and Steady as a customer. Sean, talk about the relationship with the customer Steady. What's the main product, what's the core relationship? >> Yeah, it's a really great question. when we work with a lot of companies like Steady we're working hand in hand with their data engineering teams, to help them onboard onto the Ascend platform, build these really powerful data pipelines, fueling their analytics and other workloads, and really helping to ensure that they can be successful at getting more leverage and building faster than ever before. So we tend to partner really closely with each other's teams and really think of them even as extensions of each other's own teams. I watch in slack oftentimes and our teams just go back and forth. And it's like, as if we were all just part of the same company. >> It's a really exciting time, Jason, great to have you on as a person cutting your teeth into this kind of what I call next gen data as intellectual property. Sean and I chat on theCUBE conversation previous to this event where every company is a data company, right? And we've heard that cliche. >> Right. >> But it's true, right? It's going to, it's getting more powerful with the edge. You seeing more diverse data, faster data, small, big, large, medium, all kinds of different aspects and patterns. And it's becoming a workflow kind of intellectual property paradigm for companies, not so much. >> That's right. >> Just the tech it's the database is you can, it's the data itself, data in flight, it's moving around, it's got value. What's your take-- >> Absolutely. >> On this trend? >> Basically, Steady helps our members and we have a community of members earn more income. So we want to help them steady their financial lives. And that's all based on data, so we have a web app, you could go to the iOS Store, you could go to the Google Play Store, you can download the app. And we have a large number of members, 3 million plus, who are actively using this. And we also have a very exciting new product called income passport. And this helps 1099 and mixed wage earners verify their income, which is very important for different government benefits. And then third, we help people with emergency cash grants as well as awards. So all of that is built on a bedrock of data, so if you're using our apps, it's all data powered. So what you were mentioning earlier from pipelines that are running it real time to yeah, anything, that's a kind of a small data aggregation, we do everything from small to real-time and large. >> You guys are like a multiple sided marketplace here, you've got it, you're a FinTech app, as well as the future of work and with virtual space-- >> That's right. >> Happening now, this is becoming, actually encapsulates kind of the critical problems that people trying to solve right now, you've got multiple stakeholders. >> That's right. >> In the data. >> Yes, we absolutely do. So we have our members, but we also, within the company, we have product, we have strategy, we have a growth team, we have operations. So data engineering and data science also work with a data analytics organization. So at Steady we're very much a data company. And we have a data organization led by our chief data officer and we have data engineering and data science, which are my teams, but also that business insights and analytics. So a lot of what we're building on the data engineering side is powering those insights and analytics that the business stakeholders use every day to run the organization. >> Sean, I want to get your thoughts on this because we heard from Emily Freeman in the keynote about how this revolution in DevOps or for premiering her talk around how, it's not just one persona anymore, I'm a release engineer, I'm this kind of engineer, you're seeing now all engineering, all developers are developers. You have some specialty, but for the most part, the team makeups are changing. We touched on this in our cube conversation. The journey of data is not just the data people, the data folks. It's like there's, they're developers too. So the confluence of data science, data management, developing, is changing the team and cultural makeup of companies. Could you share your thoughts on this dynamic and how it impacts customers? >> Absolutely, I think the, we're finding a similar trend to what we saw a number of years ago, when we talked about how software was eating the world and every company was now becoming a software company. And as a result, we saw this proliferation and expansion of what the software roles look like and thought of a company pulled through this entire new era of DevOps. We were finding that same pattern now emerging around data as not only is every company a software company, every company is a data company and data really is that field, that oil that fuels the business and in doing so, we're finding that as Jason describes it's pervasive across the team, it is no longer just one team that is creating some insights and reports around operational analytics, or maybe a team over here doing data science or machine learning. It is expensive. And I think the really interesting challenges that start to come with this too, are so many data teams are so over capacity. We did a recent study that highlighted that 96% of data teams are at, or over capacity, only 4% had spare capacity. But as a result, the net is being cast even wider to pull in people from even broader and more adjacent domains to all participate in the data future of their organization. >> Yeah, and I think I'd love to get your guys react to this conversation with Andy Jassy, who's now the CEO of Amazon, but when he was the CEO of AWS last year, I talked with him about how the old guard and new guard are thinking around team formations. Obviously team capacity is growing and challenged when you've got the right formula. So that's one thing, right? But what if you don't have the right formula? If you're in the skills gap, problem, or team formation side of it, where you maybe there was two years ago where the mandate came down? Well, we got to build a data team even in two years, if you're not inquisitive. And this is what Andy and I were talking about is the thinking and the mindset of that mission and being open to discovering and understanding the changes, because if you were deciding what your team was two, three years ago, that might have changed a lot. So team capacity, Sean, to your point, if you got it right, and that's a challenge in and of itself, but what if you don't have it, right? What do you guys think about this? >> Yeah, I think that's exactly right. Basically trying to see, look and gaze into the crystal ball and see what's going to happen in a year or two years, even six months is quite difficult. And if you don't have it right, you do spend a lot of time because of the technical debt that you've amassed. And we certainly spend quite a bit of time with technical debt for things we wanted to build. So, deconvolving that, getting those ETLs to a runnable state, getting performance there, that's what we spend a bit of time on. And yeah, it's something that it's really part of the package. >> What do you guys see as the big challenge on teams? The scaling challenge okay. Formation is one thing, Sean, but like, okay, getting it right, getting it formed properly and then scaling it, what are the big things you're seeing? >> One of the, I think the overarching management themes in general, it is the highest out by the highest performing teams are those where the individual with the context and the idea is able to execute as far and as fast and as efficiently as possible, and removing a lot of those encumbrances and put it a slightly different way. If DevOps was all basically boiled down to, how do we help more people write more software faster and safely data ops would be very similarly, how do we enable more people to do more things with data faster and safely? And to do that, I think the era of these massive multi-year efforts around data are gone and hopefully in the not too distant future, even these multi-quarter efforts around data are gone and we get into a much more agile, nimble methodology where smaller initiatives and smaller efforts are possible by more diverse skillsets across the business. And really what we should be doing is leveraging technology and automation to ensure that people are able to be productive and efficient and that we can trust our data and that systems are automated. And these are problems that technology is good at. And so in many ways, how in the early days Amazon would described as getting people out of the muck of DevOps. I think we're going to do the same thing around getting people out of the muck of the data and get them really focused on the higher level aspects. >> Yeah, we're going to get into that complexity, heavy lifting side muck, and then the heavy lifting taking away from the customers. But I want to go back to real quick with Jason while we're on this topic. Jason, I was just curious, how much has your team grown in the recent year and how much could've, should've grown, what's the status and how has Ascend helped you guys? What's the dynamic there? ' Cause that's their value proposition. So, take us through that. >> Absolutely, so, since the beginning of the year data engineering has doubled. So, we're a lean team, we certainly use the agile mindset and methodologies, but we have gone from, yeah, we've essentially doubled. So a lot of that is there's just so much to do and the capacity problem is certainly there. So we also spend a lot of time figuring out exactly what the right tooling is. And I was mentioning the technical debt. So you have those, there's the big O notation of whatever that involves technical debt. And when you're building new things, you're fixing old things. And then you're trying to maintain everything. That scaling starts to hit hard. So even if we continue to double, I mean, we could easily add more data engineers. And a lot of that is, I mean, you know about the hiring cycles, like, a lot of of great talent, but it's difficult to make all of those hires. So, we do spend quite a bit of time thinking about exactly what tools data engineering is using day-to-day. And what I mentioned, were technologies on the streaming side all the way to like the small batch things, but, like something that starts as a small batch getting grow and grow and grow and take, say 15 hours, it's possible, I've seen it. But, and getting that back down and managing that complexity while not overburdening people who probably don't want to spend all their waking hours building ETLs, maintaining ETL, putting in monitoring, putting in alerting, that I think is quite a challenge. >> It's so funny because you mentioned 18 hours, you got to kind of being, you didn't roll your eyes, but you almost did, but this is, but people want it yesterday, they want real time, so there's a lot of demand-- >> Yes. >> On the minds of the business outcome side of it. So, I got to ask you, because this comes up a lot with technical debt, and now we're starting to see that come into the data conversation. And so I always curious, is there a different kind of technical debt with data? Because again, data is like software, but it's a little bit of more elusive in the sense it's always changing. So is there, what kind of technical debt do you see in the data side that's different than say software side? >> Absolutely, now that's a great question. So a lot of thinking about your data and structuring your data and how you want to use that data going into a particular project might be different from what happens after stakeholders have a new considerations and new products and new items that need to be built. So thinking about how that, let's say you have a document store, or you have something that you thought was going to be nice and structured, how that can evolve and support those particular products can essentially, unless you take the time and go through and say, well, let's architect it perfectly so that we can handle that. You're going to make trade-offs and choices, and essentially that debt builds up. So you start cutting corners, you start changing your normalization. You start essentially taking those implicit schema that then tend to build into big things, big implicit schema. And then of course, with implicit schema, you're going to have a lot of null values, you're going to have a lot of items to deal with. So, how do you deal with that? And then you also have the opportunity to create keys and values and oops, do we take out those keys that were slightly misspelled? So, I could go on for hours, but basically the technical debt certainly is there with on data. I see a lot of this as just a spectrum of technical debt, because it's all trade-offs that you made to build a product, and the efficiency has start to hit you. So, the 15 hour ETL, I was mentioning, basically you start with something and you were building things for stakeholders and essentially you have so much complex logic within that. So for the transforms that you're doing from if you're thinking of the bronze, silver, gold, kind of a framework, going from that bronze to a silver, you may have a massive number of transformations or just a few, just to lightly dust it. But you could also go to gold with many more transformations and managing that, managing the complexity, managing what you're spending for servers day after day after day. That's another real challenge of that technical debt stuff. >> That's a great lead into my next question, for Sean, this is the disparate system complexity, technical debt and software was always kind of the belief was, oh yeah, I'll take some technical debt on and work it off once I get visibility and say, unit economics or some sort of platform or tool feature, and then you work it off as fast as possible. I was, this becomes the art and science of technical debt. Jason, what you're saying is that this can be unwieldy pretty quickly. You got state and you got a lot of different inter moving parts. This is a huge issue, Sean, this is where it's, technical debt in the data world is much different architecturally. If you don't get it right, this is a huge, huge issue. Could you aluminate why that is and what you guys are doing to help unify and change some of those conditions? >> Yeah, absolutely. When we think about technical debt and I'll keep drawing some parallels between DevOps and data ops, 'cause I think there's a tremendous number of similarities in these worlds. We used to always have the saying that "Your tech debt grows manually across microservices, "but exponentially within services." And so you want that right level of architecture and composibility if you will, of your systems where you can deploy changes, you can test, you can have high degrees of competence and the roll-outs. And I think the interesting part in the data side, as Jason highlighted, the big O-notation for tech debt in the data ecosystem, is still fairly exponential or polynomial in nature. As right now, we don't have great decomposition of the components. We have different systems. We have a streaming system, we have a databases, we have documents, doors and so on, but how the whole data pipeline data engineering part works generally tends to be pretty monolithic in nature. You take your whole data pipeline and you deploy the whole thing and you basically just cross your fingers, and hopefully it's not 15 hours, but if it is 15 hours, you go to sleep, you wake up the next morning, grab a coffee and then maybe it worked. And that iteration cycle is really slow. And so when we think about how we can improve these things, right? This is combinations of intelligent systems that do instantaneous schema detection, and validation, excuse me, it's combinations of things that do instantaneous schema detection and validation. It's things like automated lineage and dependency tracking. So you know that when you deploy code, what piece of data it affects, it's things like automated testing on individual core parts of your data pipelines to validate that you're getting the expected output that you need. So it's pulling a lot of these same DevOps style principles into the data world, which is really designed to going back to how do you help more people build more things faster and safely really rapid iterations for rapid feedback. So you know if there's breaks in the system much earlier on. >> Well, I think Sean, you're onto something really big there. And I think this is something that's emerging pretty quickly in the cloud scale that I called, 2.0, whatever, what version we're in, is the systems thinking mindset. 'Cause you mentioned the model that that was essentially a silo or subsystem. It was cohesive in it's own way, but now it's been monolithic. Now you have a broken down set of decomposed sets of data pieces that have to work together. So Jason, this is the big challenge that everyone, not really people are talking about, I think most these guys are, and you're using them. What are you unifying? Because this is a systems operating systems thinking, this is not like a database problem. It's a systems problem applied to data where databases are just pieces of it, what's your thoughts? >> That's absolutely right. And I would, so Sean touched on composibility of ETL and thinking about reusable components, thinking about pieces that all fit together, because as you're building something as complex as some of these ETS are, we do think about the platform itself and how that lends to the overarching output. So one thing, being able to actually see the different components of an ETL and blend those in and you as the dry principal, don't repeat yourself. So you essentially are able to take pieces that one person built, maybe John builds a couple of our connectors coming in, Sean also has a bunch of transforms and I just want this stuff out, so I can use a lot of what you guys have already built. I think that's key, because a lot of engineering and data engineering is about managing complexity. So taking that complexity and essentially getting it out fast and getting out error free, is where we're going with all of the data products we're building. >> What are some of the complexity that you guys have that you're dealing with? Can you be specific and share what these guys are doing to solve that problem for you? That's, this is a big problem everyone's having, I'm seeing that all over the place. >> Absolutely, so I could start at a couple of places. So I don't know if you guys are on the three Vs, four Vs or five Vs, but we have all of those. And if you go to that five, four or five V model, there is the veracity piece, which you have to ask yourself, is it true? Is it accurate when? So change happens throughout the pipeline, change can come from web hooks, change can come from users. You have to make sure that you're managing that complexity and what we we're building, I mentioned that we are paying down a lot of tech debt, but we're also building new products. And one pretty challenging, quite challenging ETL that we're building is something going from a document store to an analytical application. So in that document store, we talked about flexible schema. Basically, you don't really know exactly what you're going to get day to day, and you need to be able to manage that change through the whole process in a way that the ultimate business users find value. So, that's one of the key applications that we're using right now. And that's one that the team at Ascend and my team are working hand in hand going through a lot of those challenges. And it's, I also watch the slack just as Sean does, and it's a very active discussion board. So it is essentially like they're just partnering together. It's fabulous, but yeah-- >> And you're seeing kind of a value on this too, I mean, in terms of output what's the business results? >> Yes, absolutely. So essentially this is all, so yes, the fifth V value. So, getting to that value is essentially, there were a few pieces of the, to the value. So there's some data products that we're building within that product and their data science, data analytics based products that essentially do things with the data that help the user. There's also the question of exactly the usage and those kinds of metrics that people in ops want to understand as well as our growth team. So we have internal and external stakeholders for that. >> Jason, this is a great use case, a great customer, Sean, you guys are automating. For the folks watching, who were seeing their peer living the dream here and the data journey, as we say, things are happening. What's the message to customers that you guys want to send because you guys are really cutting your teeth into a whole another level of data engineering, data platform. That's really about the systems view and about cloud. What's the pitch, Sean? What should people know about the company? >> Absolutely, yeah, well, so one, I'd say even before the pitch, I would encourage people to not accept the status quo. And in particular, in data engineering today, the status quo is an incredibly high degree of pain and discomfort. And I think the important part of why Ascend exists and why we're so helpful for our customers, there is a much more automated future of how we build data products, how we optimize those and how we can get a larger cohort of builders into the data ecosystem. And that helps us get out of the muck as we talked about before and put really advanced technology to work for more people inside of our companies to build these data products, leveraging the latest and greatest technologies to drive increased business value faster. >> Jason, what's your assessment of these guys, as people are watching might say, hey, you know what, I'm going to contact them, I need this. How would you talk about Ascend into your peers? >> Absolutely, so I think just thinking about the whole process has been a great partnership. We started with a POC, I think Ascend likes to start with three use cases, I think we came out with four and we went through the ones that we really cared about and really wanted to bring value to the company with. So we have roadmaps for some, as we're paying down technical debt and transitioning, others we can go directly to. And I think that thinking about just like you're saying, John, that systems view of everything you're building, where that makes sense, you can actually take a lot of that complexity and encapsulate it in a way that you can essentially manage it all in that platform. So the Ascend platform has the composibility piece that we touched on. It also, not only can you compose it, but you can drill into it. And my team is super talented and is going to drill into it. So basically loves to open up each of those data flows each of the components therein and has the control there with the combination of Spark Sequel, PI Spark SQL Scala and so on. And I think that the variety of connections is also quite helpful. So thinking about the dry principle from a systems perspective is extremely useful because it's dry, you often get that in a code review, right? I think you can be a little bit more dry here. >> Yeah. >> But you can really do that in the way that you're composing your systems as well. >> That's a great, great point. One quick thing for the folks that they're watching that are trying to figure this out, and a lot of architecture is going on. A lot of people are looking at different solutions. What things have you learned that you could give them a tip like to avoid like maybe some scar tissue or tips of the trade, where you can say, hey, this way, be careful, what's some of the learnings? Could you give a few pointers to folks out there, if they're kicking tires on the direction, what's the wrong direction? What's the right direction look like? >> Absolutely, I think that, I think it through, and I don't know how much time we have that, that feels like a few days conversation as far as ways to go wrong. But absolutely, I think that thinking through exactly where want to be is the key. Otherwise it's kind of like when you're writing a ticket on Jarrah, if you don't have clear success criteria, if you don't know where you going to go, then you'll end up somewhere building something and it might work. But if you think through your exact destination that you want to be at, that will drive a lot of the decisions as you think backwards to where you started. And also I think that, so Sean also mentioned challenging the status quo. I think that you really have to be ready to challenge the status quo at every step of that journey. So if you start with some particular service that you had and its legacy, if it's not essentially performing what you need, then it's okay to just take a step back and say, well, maybe that's not the one. So I think that thinking through the system, just like you were saying, John, and also I think that having a visual representation of where you want to go is critical. So hopefully that encapsulates a lot of it, but yes, the destination is key. >> Yeah, and having an engineering platform that also unifies the multiple components and it's agile. >> That's right. >> It gets you out of the muck and on the last day and differentiate heavy lifting is a cloud plan. >> Absolutely. >> Sean, wrap it up for us here. What's the bumper sticker for your vision, share your founding principles of the company. >> Absolutely, for us, we started the company as a former in recovery and CTO. The last company I founded, we had nearly 60 people on our data team alone and had invested tremendous amounts of effort over the course of eight years. And one of the things that I've learned is that over time innovation comes just as much from deciding what you're no longer going to do as what you're going to do. And focusing heavily around, how do you get out of that muck? How do you continue to climb up that technology stack? Is incredibly important. And so really we are excited to be a part of it and taking the industry is continuing to climb higher and higher level. We're building more and more advanced levels of automation and what we call our data awareness into the automated engine of the Ascend platform that takes us across the entire data ecosystem, connecting and automating all data movement. And so we have a very exciting vision for this fabric that's emerging over time. >> Awesome, Sean, thank you so much for that insight, Jason, thanks for coming on customer of Ascend.io. >> Thank you. >> I appreciate it, gentlemen, thank you. This is the track on automating analytic workloads. We here at the end of us showcase, startup showcase, the hottest companies here at Ascend.io, I'm John Furrier, with theCUBE, thanks for watching. (upbeat music)

Published Date : Sep 22 2021

SUMMARY :

and Jason, nice to meet you. So, and Steady as a customer. and really helping to ensure great to have you on as a person kind of intellectual property the database is you can, So all of that is built of the critical problems that the business and cultural makeup of companies. and data really is that field, that oil but what if you don't have it, right? that it's really part of the package. What do you guys see as and the idea is able to execute as far grown in the recent year And a lot of that is, I mean, that come into the data conversation. and essentially you have so and then you work it and you basically just cross your fingers, And I think this is something and how that lends to complexity that you guys have and you need to be able of exactly the usage that you guys want to send of builders into the data ecosystem. hey, you know what, I'm going and has the control there in the way that you're that you could give them a tip of where you want to go is critical. Yeah, and having an and on the last day and What's the bumper sticker for your vision, and taking the industry is continuing Awesome, Sean, thank you This is the track on

ENTITIES

Entity	Category	Confidence
Andy	PERSON	0.99+
Jason	PERSON	0.99+
Sean	PERSON	0.99+
Emily Freeman	PERSON	0.99+
Sean Knapp	PERSON	0.99+
Jason Robinson	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
John	PERSON	0.99+
Andy Jassy	PERSON	0.99+
AWS	ORGANIZATION	0.99+
John Furrier	PERSON	0.99+
15 hours	QUANTITY	0.99+
Ascend	ORGANIZATION	0.99+
last year	DATE	0.99+
96%	QUANTITY	0.99+
eight years	QUANTITY	0.99+
15 hour	QUANTITY	0.99+
iOS Store	TITLE	0.99+
18 hours	QUANTITY	0.99+
Google Play Store	TITLE	0.99+
Ascend.io	ORGANIZATION	0.99+
Steady	ORGANIZATION	0.99+
yesterday	DATE	0.99+
six months	QUANTITY	0.99+
five	QUANTITY	0.99+
third	QUANTITY	0.99+
Spark Sequel	TITLE	0.99+
two	DATE	0.98+
Today	DATE	0.98+
a year	QUANTITY	0.98+
two years	QUANTITY	0.98+
two years ago	DATE	0.98+
today	DATE	0.98+
four	QUANTITY	0.98+
Jarrah	PERSON	0.98+
each	QUANTITY	0.97+
theCUBE	ORGANIZATION	0.97+
three years ago	DATE	0.97+
one	QUANTITY	0.97+
3 million plus	QUANTITY	0.97+
4%	QUANTITY	0.97+
one thing	QUANTITY	0.96+
one team	QUANTITY	0.95+
three use cases	QUANTITY	0.94+
one person	QUANTITY	0.93+
nearly 60 people	QUANTITY	0.93+
one persona	QUANTITY	0.91+

Sam Lightstone, IBM | Machine Learning Everywhere 2018

>> Narrator: Live from New York, it's the Cube. Covering Machine Learning Everywhere: Build Your Ladder to AI. Brought to you by IBM. >> And welcome back here to New York City. We're at IBM's Machine Learning Everywhere: Build Your Ladder to AI, along with Dave Vellante, John Walls, and we're now joined by Sam Lightstone, who is an IBM fellow in analytics. And Sam, good morning. Thanks for joining us here once again on the Cube. >> Yeah, thanks a lot. Great to be back. >> Yeah, great. Yeah, good to have you here on kind of a moldy New York day here in late February. So we're talking, obviously data is the new norm, is what certainly, have heard a lot about here today and of late here from IBM. Talk to me about, in your terms, of just when you look at data and evolution and to where it's now become so central to what every enterprise is doing and must do. I mean, how do you do it? Give me a 30,000-foot level right now from your prism. >> Sure, I mean, from a super, if you just stand back, like way far back, and look at what data means to us today, it's really the thing that is separating companies one from the other. How much data do they have and can they make excellent use of it to achieve competitive advantage? And so many companies today are about data and only data. I mean, I'll give you some like really striking, disruptive examples of companies that are tremendously successful household names and it's all about the data. So the world's largest transportation company, or personal taxi, can't call it taxi, but (laughs) but, you know, Uber-- >> Yeah, right. >> Owns no cars, right? The world's largest accommodation company, Airbnb, owns no hotels, right? The world's largest distributor of motion pictures owns no movie theaters. So these companies are disrupting because they're focused on data, not on the material stuff. Material stuff is important, obviously. Somebody needs to own a car, somebody needs to own a way to view a motion picture, and so on. But data is what differentiates companies more than anything else today. And can they tap into the data, can they make sense of it for competitive advantage? And that's not only true for companies that are, you know, cloud companies. That's true for every company, whether you're a bricks and mortars organization or not. Now, one level of that data is to simply look at the data and ask questions of the data, the kinds of data that you already have in your mind. Generating reports, understanding who your customers are, and so on. That's sort of a fundamental level. But the deeper level, the exciting transformation that's going on right now, is the transformation from reporting and what we'll call business intelligence, the ability to take those reports and that insight on data and to visualize it in the way that human beings can understand it, and go much deeper into machine learning and AI, cognitive computing where we can start to learn from this data and learn at the pace of machines, and to drill into the data in a way that a human being cannot because we can't look at bajillions of bytes of data on our own, but machines can do that and they're very good at doing that. So it is a huge, that's one level. The other level is, there's so much more data now than there ever was because there's so many more devices that are now collecting data. And all of us, you know, every one of our phones is collecting data right now. Your cars are collecting data. I think there's something like 60 sensors on every car that rolls of the manufacturing line today. 60. So it's just a wild time and a very exciting time because there's so much untapped potential. And that's what we're here about today, you know. Machine learning, tapping into that unbelievable potential that's there in that data. >> So you're absolutely right on. I mean the data is foundational, or must be foundational in order to succeed in this sort of data-driven world. But it's not necessarily the center of the universe for a lot of companies. I mean, it is for the big data, you know, guys that we all know. You know, the top market cap companies. But so many organizations, they're sort of, human expertise is at the center of their universe, and data is sort of, oh yeah, bolt on, and like you say, reporting. >> Right. >> So how do they deal with that? Do they get one big giant DB2 instance and stuff all the data in there, and infuse it with MI? Is that even practical? How do they solve this problem? >> Yeah, that's a great question. And there's, again, there's a multi-layered answer to that. But let me start with the most, you know, one of the big changes, one of the massive shifts that's been going on over the last decade is the shift to cloud. And people think of the shift to cloud as, well, I don't have to own the server. Someone else will own the server. That's actually not the right way to look at it. I mean, that is one element of cloud computing, but it's not, for me, the most transformative. The big thing about the cloud is the introduction of fully-managed services. It's not just you don't own the server. You don't have to install, configure, or tune anything. Now that's directly related to the topic that you just raised, because people have expertise, domains of expertise in their business. Maybe you're a manufacturer and you have expertise in manufacturing. If you're a bank, you have expertise in banking. You may not be a high-tech expert. You may not have deep skills in tech. So one of the great elements of the cloud is that now you can use these fully managed services and you don't have to be a database expert anymore. You don't have to be an expert in tuning SQL or JSON, or yadda yadda. Someone else takes care of that for you, and that's the elegance of a fully managed service, not just that someone else has got the hardware, but they're taking care of all the complexity. And that's huge. The other thing that I would say is, you know, the companies that are really like the big data houses, they got lots of data, they've spent the last 20 years working so hard to converge their data into larger and larger data lakes. And some have been more successful than others. But everybody has found that that's quite hard to do. Data is coming in many places, in many different repositories, and trying to consolidate, you know, rip the data out, constantly ripping it out and replicating into some data lake where you, or data warehouse where you can do your analytics, is complicated. And it means in some ways you're multiplying your costs because you have the data in its original location and now you're copying it into yet another location. You've got to pay for that, too. So you're multiplying costs. So one of the things I'm very excited about at IBM is we've been working on this new technology that we've now branded it as IBM Queryplex. And that gives us the ability to query data across all of these myriad sources as if they are in one place. As if they are a single consolidated data lake, and make it all look like (snaps) one repository. And not only to the application appear as one repository, but actually tap into the processing power of every one of those data sources. So if you have 1,000 of them, we'll bring to bear the power 1,000 data sources and all that computing and all that memory on these analytics problems. >> Well, give me an example why that matters, of what would be a real-world application of that. >> Oh, sure, so there, you know, there's a couple of examples. I'll give you two extremes, two different extremes. One extreme would be what I'll call enterprise, enterprise data consolidation or virtualization, where you're a large institution and you have several of these repositories. Maybe you got some IBM repositories like DB2. Maybe you've got a little bit of Oracle and a little bit of SQL Server. Maybe you've got some open source stuff like Postgres or MySQL. You got a bunch of these and different departments use different things, and it develops over decades and to some extent you can't even control it, (laughs) right? And now you just want to get analytics on that. You just, what's this data telling me? And as long as all that data is sitting in these, you know, dozens or hundreds of different repositories, you can't tell, unless you copy it all out into a big data lake, which is expensive and complicated. So Queryplex will solve that problem. >> So it's sort of a virtual data store. >> Yeah, and one of the terms, many different terms that are used, but one of the terms that's used in the industry is data virtualization. So that would be a suitable terminology here as well. To make all that data in hundreds, thousands, even millions of possible data sources, appear as one thing, it has to tap into the processing power of all of them at once. Now, that's one extreme. Let's take another extreme, which is even more extreme, which is the IoT scenario, Internet of Things, right? Internet of Things. Imagine you've, have devices, you know, shipping containers and smart meters on buildings. You could literally have 100,000 of these or a million of these things. They're usually small; they don't usually have a lot of data on them. But they can store, usually, couple of months of data. And what's fascinating about that is that most analytics today are really on the most recent you know, 48 hours or four weeks, maybe. And that time is getting shorter and shorter, because people are doing analytics more regularly and they're interested in, just tell me what's going on recently. >> I got to geek out here, for a second. >> Please, well thanks for the warning. (laughs) >> And I know you know things, but I'm not a, I'm not a technical person, but I've been a molt. I've been around a long time. A lot of questions on data virtualization, but let me start with Queryplex. The name is really interesting to me. When I, and you're a database expert, so I'm going to tap your expertise. When I read the Google Spanner paper, I called up my colleague David Floyer, who's an ex-IBM, I said, "This is like global Sysplex. "It's a global distributed thing," And he goes, "Yeah, kind of." And I got very excited. And then my eyes started bleeding when I read the paper, but the name, Queryplex, is it a play on Sysplex? Is there-- >> It's actually, there's a long story. I don't think I can say the story on-air, but we, suffice it to say we wanted to get a name that was legally usable and also descriptive. >> Dave: Okay. >> And we went through literally hundreds and hundreds of permutations of words and we finally landed on Queryplex. But, you know, you mentioned Google Spanner. I probably should spend a moment to differentiate how what we're doing is-- >> Great, if you would. >> A different kind of thing. You know, on Google Spanner, you put data into Google Spanner. With Queryplex, you don't put data into it. >> Dave: Don't have to move it. >> You don't have to move it. You leave it where it is. You can have your data in DB2, you can have it in Oracle, you can have it in a flat file, you can have an Excel spreadsheet, and you know, think about that. An Excel spreadsheet, a collection of text files, comma delimited text files, SQL Server, Oracle, DB2, Netezza, all these things suddenly appear as one database. So that's the transformation. It's not about we'll take your data and copy it into our system, this is about leave your data where it is, and we're going to tap into your (snaps) existing systems for you and help you see them in a unified way. So it's a very different paradigm than what others have done. Part of the reason why we're so excited about it is we're, as far as we know, nobody else is really doing anything quite like this. >> And is that what gets people to the 21st century, basically, is that they have all these legacy systems and yet the conversion is much simpler, much more economical for them? >> Yeah, exactly. It's economical, it's fast. (snaps) You can deploy this in, you know, a very small amount of time. And we're here today talking about machine learning and it's a very good segue to point out in order to get to high-quality AI, you need to have a really strong foundation of an information architecture. And for the industry to show up, as some have done over the past decade, and keep telling people to re-architect their data infrastructure, keep modifying their databases and creating new databases and data lakes and warehouses, you know, it's just not realistic. And so we want to provide a different path. A path that says we're going to make it possible for you to have superb machine learning, cognitive computing, artificial intelligence, and you don't have to rebuild your information architecture. We're going to make it possible for you to leverage what you have and do something special. >> This is exciting. I wasn't aware of this capability. And we were talking earlier about the cloud and the managed service component of that as a major driver of lowering cost and complexity. There's another factor here, which is, we talked about moving data-- >> Right. >> And that's one of the most expensive components of any infrastructure. If I got to move data and the transmission costs and the latency, it's virtually impossible. Speed of light's still up. I know you guys are working on speed of light, but (Sam laughs) you'll eventually get there. >> Right. >> Maybe. But the other thing about cloud economics, and this relates to sort of Queryplex. There's this API economy. You've got virtually zero marginal costs. When you were talking, I was writing these down. You got global scale, it's never down, you've got this network effect working for you. Are you able to, are the standards there? Are you able to replicate those sort of cloud economics the APIs, the standards, that scale, even though you're not in control of this, there's not a single point of control? Can you explain sort of how that magic works? >> Yeah, well I think the API economy is for real and it's very important for us. And it's very important that, you know, we talk about API standards. There's a beautiful quote I once heard. The beautiful thing about standards is there's so many to choose from. (All laugh) And the reality is that, you know, you have standards that are official standards, and then you have the de facto standards because something just catches on and nobody blessed it. It just got popular. So that's a big part of what we're doing at IBM is being at the forefront of adopting the standards that matter. We made a big, a big investment in being Spark compatible, and, in fact, even with Queryplex. You can issue Spark SQL against Queryplex even though it's not a Spark engine, per se, but we make it look and feel like it can be Spark SQL. Another critical point here, when we talk about the API economy, and the speed of light, and movement to the cloud, and these topics you just raised, the friction of the Internet is an unbelievable friction. (John laughs) It's unbelievable. I mean, you know, when you go and watch a movie over the Internet, your home connection is just barely keeping up. I mean, you're pushing it, man. So a gigabyte, you know, a gigabyte an hour or something like that, right? Okay, and if you're a big company, maybe you have a fatter pipe. But not a lot fatter. I mean, not orders of, you're talking incredible friction. And what that means is that it is difficult for people, for companies, to en masse, move everything to the cloud. It's just not happening overnight. And, again, in the interest of doing the best possible service to our customers, that's why we've made it a fundamental element of our strategy in IBM to be a hybrid, what we call hybrid data management company, so that the APIs that we use on the cloud, they are compatible with the APIs that we use on premises. And whether that's software or private cloud. You've got software, you've got private cloud, you've got public cloud. And our APIs are going to be consistent across, and applications that you code for one will run on the other. And you can, that makes it a lot easier to migrate at your leisure when you're ready. >> Makes a lot of sense. That way you can bring cloud economics and the cloud operating model to your data, wherever the data exists. Listening to you speak, Sam, it reminds me, do you remember when Bob Metcalfe who I used to work with at IDG, predicted the collapse of the Internet? He predicted that year after year after year, in speech after speech, that it was so fragile, and you're bringing back that point of, guys, it's still, you know, a lot of friction. So that's very interesting, (laughs) as an architect. >> You think Bob's going to be happy that you brought up that he predicted the Internet was going to be its own demise? (Sam laughs) >> Well, he did it in-- >> I'm just saying. >> I'm staying out of it, man. >> He did it as a lightning rod. >> As a talking-- >> To get the industry to respond, and he had a big enough voice so he could do that. >> That it worked, right. But so I want to get back to Queryplex and the secret sauce. Somehow you're creating this data virtualization capability. What's the secret sauce behind it? >> Yeah, so I think, we're not the first to try, by the way. Actually this problem-- >> Hard problem. >> Of all these data sources all over the place, you try to make them look like one thing. People have been trying to figure out how to do that since like the '70s, okay, so, but-- >> Dave: Really hasn't worked. >> And it hasn't worked. And really, the reason why it hasn't worked is that there's been two fundamental strategies. One strategy is, you have a central coordinator that tries to speak to each of these data sources. So I've got, let's say, 10,000 data sources. I want to have one coordinator tap into each of them and have a dialogue. And what happens is that that coordinator, a server, an agent somewhere, becomes a network bottleneck. You were talking about the friction of the Internet. This is a great example of friction. One coordinator trying to speak to, you know, and collaborators becomes a point of friction. And it also becomes a point of friction not only in the Internet, but also in the computation, because he ends up doing too much of the work. There's too many things that cannot be done at the, at these edge repositories, aggregations, and joins, and so on. So all the aggregations and joins get done by this one sucker who can't keep up. >> Dave: The queue. >> Yeah, so there's a big queue, right. So that's one strategy that didn't work. The other strategy that people tried was sort of an end squared topology where every data source tries to speak to every other data source. And that doesn't scale as well. So what we've done in Queryplex is something that we think is unique and much more organic where we try to organize the universe or constellation of these data sources so that every data source speaks to a small number of peers but not a large number of peers. And that way no single source is a bottleneck, either in network or in computation. That's one trick. And the second trick is we've designed algorithms that can truly be distributed. So you can do joins in a distributed manner. You can do aggregation in a distributed manner. These are things, you know, when I say aggregation, I'm talking about simple things like a sum or an average or a median. These are super popular in, in analytic queries. Everybody wants to do a sum or an average or a median, right? But in the past, those things were hard to do in a distributed manner, getting all the participants in this universe to do some small incremental piece of the computation. So it's really these two things. Number one, this organic, dynamically forming constellation of devices. Dynamically forming a way that is latency aware. So if I'm a, if I represent a data source that's joining this universe or constellation, I'm going to try to find peers who I have a fast connection with. If all the universe of peers were out there, I'll try to find ones that are fast. And the second is having algorithms that we can all collaborate on. Those two things change the game. >> We're getting the two minute sign, and this is fascinating stuff. But so, how do you deal with the data consistency problem? You hear about eventual consistency and people using atomic clocks and-- Right, so Queryplex, you know, there's a reason we call it Queryplex not Dataplex. Queryplex is really a read-only operation. >> Dave: Oh, there you go. >> You've got all these-- >> Problem solved. (laughs) >> Problem solved. You've got all these data sources. They're already doing their, they already have data's coming in how it's coming in. >> Dave: Simple and brilliant. >> Right, and we're not changing any of that. All we're saying is, if you want to query them as one, you can query them as one. I should say a few words about the machine learning that we're doing here at the conference. We've talked about the importance of an information architecture and how that lays a foundation for machine learning. But one of the things that we're showing and demonstrating at the conference today, or at the showcase today, is how we're actually putting machine learning into the database. Create databases that learn and improve over time, learn from experience. In 1952, Arthur Samuel was a researcher at IBM who first, had one of the most fundamental breakthroughs in machine learning when he created a machine learning algorithm that will play checkers. And he programmed this checker playing game of his so it would learn over time. And then he had a great idea. He programmed it so it would play itself, thousands and thousands and thousands of times over, so it would actually learn from its own mistakes. And, you know, the evolution since then. Deep Blue playing chess and so on. The Watson Jeopardy game. We've seen tremendous potential in machine learning. We're putting into the database so databases can be smarter, faster, more consistent, and really just out of the box (snaps) performing. >> I'm glad you brought that up. I was going to ask you, because the legend Steve Mills once said to me, I had asked him a question about in-memory databases. He said ever databases have been around, in-memory databases have been around. But ML-infused databases are new. >> Sam: That's right, something totally new. >> Dave: Yeah, great. >> Well, you mentioned Deep Blue. Looking forward to having Garry Kasparov on a little bit later on here. And I know he's speaking as well. But fascinating stuff that you've covered here, Sam. We appreciate the time here. >> Thank you, thanks for having me. >> And wish you continued success, as well. >> Thank you very much. >> Sam Lightstone, IBM fellow joining us here live on the Cube. We're back with more here from New York City right after this. (electronic music)

Published Date : Feb 27 2018

SUMMARY :

Brought to you by IBM. and we're now joined by Sam Lightstone, Great to be back. Yeah, good to have you here on kind of a moldy New York day and it's all about the data. the kinds of data that you already have in your mind. I mean, it is for the big data, you know, and trying to consolidate, you know, rip the data out, of what would be a real-world application of that. and you have several of these repositories. Yeah, and one of the terms, Please, well thanks for the warning. And I know you know things, but I'm not a, suffice it to say we wanted to get a name that was But, you know, you mentioned Google Spanner. With Queryplex, you don't put data into it. and you know, think about that. And for the industry to show up, and the managed service component of that And that's one of the most expensive components and this relates to sort of Queryplex. And the reality is that, you know, and the cloud operating model to your data, To get the industry What's the secret sauce behind it? Yeah, so I think, we're not the first to try, by the way. you try to make them look like one thing. And really, the reason why it hasn't worked is that And the second trick is Right, so Queryplex, you know, Problem solved. You've got all these data sources. and really just out of the box (snaps) performing. because the legend Steve Mills once said to me, Well, you mentioned Deep Blue. live on the Cube.

ENTITIES

Entity	Category	Confidence
David	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
Dave Vellante	PERSON	0.99+
Justin Warren	PERSON	0.99+
Sanjay Poonen	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Clarke	PERSON	0.99+
David Floyer	PERSON	0.99+
Jeff Frick	PERSON	0.99+
Dave Volante	PERSON	0.99+
George	PERSON	0.99+
Dave	PERSON	0.99+
Diane Greene	PERSON	0.99+
Michele Paluso	PERSON	0.99+
AWS	ORGANIZATION	0.99+
Sam Lightstone	PERSON	0.99+
Dan Hushon	PERSON	0.99+
Nutanix	ORGANIZATION	0.99+
Teresa Carlson	PERSON	0.99+
Kevin	PERSON	0.99+
Andy Armstrong	PERSON	0.99+
Michael Dell	PERSON	0.99+
Pat Gelsinger	PERSON	0.99+
John	PERSON	0.99+
Google	ORGANIZATION	0.99+
Lisa Martin	PERSON	0.99+
Kevin Sheehan	PERSON	0.99+
Leandro Nunez	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Oracle	ORGANIZATION	0.99+
Alibaba	ORGANIZATION	0.99+
NVIDIA	ORGANIZATION	0.99+
EMC	ORGANIZATION	0.99+
GE	ORGANIZATION	0.99+
NetApp	ORGANIZATION	0.99+
Keith	PERSON	0.99+
Bob Metcalfe	PERSON	0.99+
VMware	ORGANIZATION	0.99+
90%	QUANTITY	0.99+
Sam	PERSON	0.99+
Larry Biagini	PERSON	0.99+
Rebecca Knight	PERSON	0.99+
Brendan	PERSON	0.99+
Dell	ORGANIZATION	0.99+
Peter	PERSON	0.99+
Clarke Patterson	PERSON	0.99+

Josh Klahr & Prashanthi Paty | DataWorks Summit 2017

>> Announcer: Live from San Jose, in the heart of Silicon Valley, it's theCUBE, covering DataWorks Summit 2017. Brought to you by Hortonworks. >> Hey, welcome back to theCUBE. Day two of the DataWorks Summit, I'm Lisa Martin with my cohost, George Gilbert. We've had a great day and a half so far, learning a ton in this hyper-growth, big data world meets IoT, machine learning, data science. George and I are excited to welcome our next guests. We have Josh Klahr, the VP of Product Management from AtScale. Welcome George, welcome back. >> Thank you. >> And we have Prashanthi Paty, the Head of Data Engineering for GoDaddy. Welcome to theCUBE. >> Thank you. >> Great to have you guys here. So, wanted to kind of talk to you guys about, one, how you guys are working together, but two, also some of the trends that you guys are seeing. So as we talked about, in the tech industry, it's two degrees of Kevin Bacon, right. You guys worked together back in the day at Yahoo. Talk to us about what you both visualized and experienced in terms of the Hadoop adoption maturity cycle. >> Sure. >> You want to start, Josh? >> Yeah, I'll start, and you can chime in and correct me. But yeah, as you mentioned, Prashanthi and I worked together at Yahoo. It feels like a long time ago. In our central data group. And we had two main jobs. First job was, collect all of the data from our ad systems, our audience systems, and stick that data into a Hadoop cluster. At the time, we were kind of doing it while Hadoop was kind of being developed. And the other thing that we did was, we had to support a bunch of BI consumers. So we built cubes, we built data marts, we used MicroStrategy, Tableau, and I would say the experience there was a great experience with Hadoop in terms of the ability to have low-cost storage, scale out data processing of all of, what were really, billions and billions, tens of billions of events a day. But when it came to BI, it felt like we were doing stuff the old way. And we were moving data off cluster, and making it small. In fact, you did a lot of that. >> Well, yeah, at the end of the day, we were using Hadoop as a staging layer. So we would process a whole bunch of data there, and then we would scale it back, and move it into, again, relational stores or cubes, because basically we couldn't afford to give any accessibility to BI tools or to our end users directly on Hadoop. So while we surely did a large-scale data processing in Hadoop layer, we failed to turn on the insights right there. >> Lisa: Okay. >> Maybe there's a lesson in there for folks who are getting slightly more mature versions of Hadoop now, but can learn from also some of the experiences you've had. Were there issues in terms of, having cleaned and curated data, were there issues for BI with performance and the lack of proper file formats like Parquet? What was it that where you hit the wall? >> It was both, you have to remember this, we were probably one of the first teams to put a data warehouse on Hadoop. So we were dealing with Pig versions of like, 0.5, 0.6, so we were putting a lot of demand on the tooling and the infrastructure. Hadoop was still in a very nascent stage at that time. That was one. And I think a lot of the focus was on, hey, now we have the ability to do clickstream analytics at scale, right. So we did a lot of the backend stuff. But the presentation is where I think we struggled. >> So would that mean that you did do, the idea is that you could do full resolution without sampling on the backend, and then you would extract and presumably sort of denormalize so that you could, essentially run data match for subject matter interests. >> Yeah, and that's exactly what we did is, we took all of this big data, but to make it work for BI, which were two things, one was performance. It was really, can you get an interactive query and response time. And the other thing was the interface. Can a Tableau user connect and understand what they're looking at. You had to make the data small again. And that was actually the genesis of AtScale, which is where I am today, was, we were frustrated with this, big data platform and having to then make the data small again in order to support BI. >> That's a great transition, Josh. Let's actually talk about AtScale. You guys saw BI on Hadoop as this big white space. How have you succeeded there, and then let's talk about what GoDaddy is doing with AtScale and big data. >> Yeah, I think that we definitely learned, we took the learnings from our experience at Yahoo, and we really thought about, if we were to start from scratch, and solve the problem the way we wanted it to be solved, what would that system look like. And it was a few things. One was an interface that worked for BI. I don't want to date myself, but my experience in the software space started with OLAP. And I can tell you OLAP isn't dead. When you go and talk to an enterprise, a fortune 1000 enterprise and you talk about OLAP, that's how they think. They think in terms of measures and dimensions and hierarchies. So one important thing for us was to project an OLAP interface on top of data that's Hadoop native. It's Hive tables, Parquet, ORC, you kind of talk about all of the mess that may sit underneath the covers. So one thing was projecting that interface, the other thing was delivering performance. So we've invested a lot in using the Hadoop cluster natively to deliver performing queries. We do this by creating aggregate tables and summary tables and being smart about how we route queries. But we've done it in a way that makes a Hadoop admin very happy. You don't have to buy a bunch of AtScale servers in addition to your Hadoop cluster. We scale the way the Hadoop cluster scales. So we don't require separate technology. So we fit really nicely into that Hadoop ecosystem. >> So how do you make, making the Hadoop admin happy is a good thing. How do you make the business user happy, who needs now, as we were here yesterday, to kind of merge more with the data science folks to be able to understand or even have the chance to articulate, "These are the business outcomes "we want to look for and we want to see." How do you guys, maybe, under the hood, if you will, AtScale, make the business guys and gals happy? >> I'll share my opinion and then Prashanthi can comment on her experience but, as I've mentioned before, the business users want an interface that's simple to use. And so that's one thing we do, is, we give them the ability to just look at measures and dimensions. If I'm a business, I grew up using Excel to do my analysis. The thing I like most as an analyst is a big fat wide table. And so that's what, we make an underlying Hadoop cluster and what could be tens or hundreds of tables look like a single big fat wide table for a data analyst. You talk to a data scientist, you talk to a business analyst, that's the way they want to view the world. So that's one thing we do. And then, we give them response times that are fast. We give them interactivity, so that you could really quickly start to get a sense of the shape of the data. >> And allowing them to get that time to value. >> Yes. >> I can imagine. >> Just a follow-up on that. When you have to prepare the aggregates, essentially like the cubes, instead of the old BI tools running on a data mart, what is the additional latency that's required from data coming fresh into the data lake and then transforming it into something that's consumption ready for the business user? >> Yeah, I think I can take that. So again, if you look at the last 10 years, in the initial period, certainly at Yahoo, we just threw engineering resources at that problem, right. So we had teams dedicated to building these aggregates. But the whole premise of Hadoop was the ability to do unstructured optimizations. And by having a team find out the new data coming in and then integrating that into your pipeline, so we were adding a lot of latency. And so we needed to figure out how we can do this in a more seamless way, in a more real-time way. And get the, you know, the real premise of Hadoop. Get it at the hands of our business users. I mean, I think that's where AtScale is doing a lot of the good work in terms of dynamically being able to create aggregates based on the design that you put in the cube. So we are starting to work with them on our implementation. We're looking forward to the results. >> Tell us a little bit more about what you're looking to achieve. So GoDaddy is a customer of AtScale. Tell us a little bit more about that. What are you looking to build together, and kind of, where are you in your journey right now? >> Yeah, so the main goal for us is to move beyond predefined models, dashboards, and reports. So we want to be more agile with our schema changes. Time to market is one. And performance, right. Ability to put BI tools directly on top of Hadoop, is one. And also to push as much of the semantics as possible down into the Hadoop layer. So those are the things that we're looking to do. >> So that sounds like a classic business intelligence component, but sort of rethought for a big data era. >> I love that quote, and I feel it. >> Prashanthi: Yes. >> Josh: Yes. (laughing) >> That's exactly what we're trying to do. >> But that's also, some of the things you mentioned are non-trivial. You want to have this, time goes in to the pre-processing of data so that it's consumable, but you also wanted it to be dynamic, which is sort of a trade-off, which means, you know, that takes time. So is that a sort of a set of requirements, a wishlist for AtScale, or is that something that you're building on your own? >> I think there's a lot happening in that space. They are one of the first people to come out with their product, which is solving a real problem that we tried to solve for a long time. And I think as we start using them more and more, we'll surely be pushing them to bring in more features. I think the algorithm that they have to dynamically generate aggregates is something that we're giving quite a lot of feedback to them on. >> Our last guest from Pentaho was talking about, there was, in her keynote today, the quote from I think McKinsey report that said, "40% of machine learning data is either not fully "exploited or not used at all." So, tell us, kind of, where is big daddy regarding machine learning? What are you seeing? What are you seeing at AtScale and how are you guys going to work together to maybe venture into that frontier? >> Yeah, I mean, I think one of the key requirements we're placing on our data scientists is, not only do you have to be very good at your data science job, you have to be a very good programmer too to make use of the big data technologies. And we're seeing some interesting developments like very workload-specific engines coming into the market now for search, for graph, for machine learning, as well. Which is supposed to give the tools right into the hands of data scientists. I personally haven't worked with them to be able to comment. But I do think that the next realm on big data is this workload-specific engines, and coming on top of Hadoop, and realizing more of the insights for the end users. >> Curious, can you elaborate a little more on those workload-specific engines, that sounds rather intriguing. >> Well, I think interactive, interacting with Hadoop on a real-time basis, we see search-based engines like Elasticsearch, Solr, and there is also Druid. At Yahoo, we were quite a bit shop of Druid actually. And we were using it as an interactive query layer directly with our applications, BI applications. This is our JavaScript-based BI applications, and Hadoop. So I think there are quite a few means to realize insights from Hadoop now. And that's the space where I see workload-specific engines coming in. >> And you mentioned earlier before we started that you were using Mahout, presumably for machine learning. And I guess I thought the center of gravity for that type of analytics has moved to Spark, and you haven't mentioned Spark yet. We are not using Mahout though. I mentioned it as something that's in that space. But yeah, I mean, Spark is pretty interesting. Spark SQL, doing ETL with Spark, as well as using Spark SQL for queries is something that looks very, very promising lately. >> Quick question for you, from a business perspective, so you're the Head of Engineering at GoDaddy. How do you interact with your business users? The C-suite, for example, where data science, machine learning, they understand, we have to have, they're embracing Hadoop more and more. They need to really, embracing big data and leveraging Hadoop as an enabler. What's the conversation like, or maybe even the influence of the GoDaddy business C-suite on engineering? How do you guys work collaboratively? >> So we do have very regular stakeholder meeting. And these are business stakeholders. So we have representatives from our marketing teams, finance, product teams, and data science team. We consider data science as one of our customers. We take requirements from them. We give them peek into the work we're doing. We also let them be part of our agile team so that when we have something released, they're the first ones looking at it and testing it. So they're very much part of the process. I don't think we can afford to just sit back and work on this monolithic data warehouse and at the end of the day say, "Hey, here is what we have" and ask them to go get the insights from it. So it's a very agile process, and they're very much part of it. >> One last question for you, sorry George, is, you guys mentioned you are sort of early in your partnership, unless I misunderstood. What has AtScale help GoDaddy achieve so far and what are your expectations, say the next six months? >> We want the world. (laughing) >> Lisa: Just that. >> Yeah, but the premise is, I mean, so Josh and I, we were part of the same team at Yahoo, where we faced problems that AtScale is trying to solve. So the premise of being able to solve those problems, which is, like their name, basically delivering data at scale, that's the premise that I'm very much looking forward to from them. >> Well, excellent. Well, we want to thank you both for joining us on theCUBE. We wish you the best of luck in attaining the world. (all laughing) >> Josh: There we go, thank you. >> Excellent, guys. Josh Klahr, thank you so much. >> My pleasure. Prashanthi, thank you for being on theCUBE for the first time. >> No problem. >> You've been watching theCUBE live at the day two of the DataWorks Summit. For my cohost George Gilbert, I am Lisa Martin. Stick around guys, we'll be right back. (jingle)

Published Date : Jun 14 2017

SUMMARY :

Brought to you by Hortonworks. George and I are excited to welcome our next guests. And we have Prashanthi Paty, Talk to us about what you both visualized and experienced And the other thing that we did was, and then we would scale it back, and the lack of proper file formats like Parquet? So we were dealing with Pig versions of like, the idea is that you could do full resolution And the other thing was the interface. How have you succeeded there, and solve the problem the way we wanted it to be solved, So how do you make, And so that's one thing we do, is, that's consumption ready for the business user? based on the design that you put in the cube. and kind of, where are you in your journey right now? So we want to be more agile with our schema changes. So that sounds like a classic business intelligence Josh: Yes. of data so that it's consumable, but you also wanted And I think as we start using them more and more, What are you seeing at AtScale and how are you guys and realizing more of the insights for the end users. Curious, can you elaborate a little more And we were using it as an interactive query layer and you haven't mentioned Spark yet. machine learning, they understand, we have to have, and at the end of the day say, "Hey, here is what we have" you guys mentioned you are sort of early We want the world. So the premise of being able to solve those problems, Well, we want to thank you both for joining us on theCUBE. Josh Klahr, thank you so much. for the first time. of the DataWorks Summit.

ENTITIES

Entity	Category	Confidence
Josh	PERSON	0.99+
George	PERSON	0.99+
Lisa Martin	PERSON	0.99+
George Gilbert	PERSON	0.99+
Josh Klahr	PERSON	0.99+
Prashanthi Paty	PERSON	0.99+
Prashanthi	PERSON	0.99+
Lisa	PERSON	0.99+
Yahoo	ORGANIZATION	0.99+
Kevin Bacon	PERSON	0.99+
San Jose	LOCATION	0.99+
Excel	TITLE	0.99+
Silicon Valley	LOCATION	0.99+
GoDaddy	ORGANIZATION	0.99+
40%	QUANTITY	0.99+
yesterday	DATE	0.99+
AtScale	ORGANIZATION	0.99+
tens	QUANTITY	0.99+
Spark	TITLE	0.99+
Druid	TITLE	0.99+
First job	QUANTITY	0.99+
Hadoop	TITLE	0.99+
two	QUANTITY	0.99+
Spark SQL	TITLE	0.99+
today	DATE	0.99+
two degrees	QUANTITY	0.99+
both	QUANTITY	0.98+
one	QUANTITY	0.98+
DataWorks Summit	EVENT	0.98+
two things	QUANTITY	0.98+
Elasticsearch	TITLE	0.98+
first time	QUANTITY	0.98+
DataWorks Summit 2017	EVENT	0.97+
first teams	QUANTITY	0.96+
Solr	TITLE	0.96+
Mahout	TITLE	0.95+
hundreds of tables	QUANTITY	0.95+
two main jobs	QUANTITY	0.94+
One last question	QUANTITY	0.94+
billions and	QUANTITY	0.94+
McKinsey	ORGANIZATION	0.94+
Day two	QUANTITY	0.94+
One	QUANTITY	0.94+
Parquet	TITLE	0.94+
Tableau	TITLE	0.93+

Bryan Duxbury, StreamSets | Spark Summit East 2017

>> Announcer: Live from Boston, Massachusetts. This is "The Cube" covering Spark Summit East 2017. Brought to you by Databricks. Now here are your hosts Dave Volante and George Gilbert. >> Welcome back to snowy Boston everybody. This is "The Cube." The leader in live tech coverage. This is Spark Summit. Spark Summit East #SparkSummit. Bryan Duxbury's here. He's the vice president of engineering at StreamSets. Cleveland boy! Welcome to "The Cube." >> Thanks for having me. >> You've very welcome. Tell us, let's start with StreamSets. We're going to talk about Spark and some of the use cases that it's enabling and some of the integrations you're doing. But what does StreamSets do? >> Sure, StreamSets is a data movement software. So I like to think of it either the first mile or the last mile of a lot of different analytical or data movement workflows. Basically we build a product that allows you to build a workflow, or build a data pipeline that doesn't require you to code. It's a graphical user interphase for dropping an origin, several destinations, and then lightweight transformations onto a canvas. You click play and it runs. So this is kind of different than, a lot of the market today is a programming tool or a command line tool. That still requires your systems engineers or your unfortunate data scientists pretending to be systems engineers to do systems engineering. To do a science project to figure out how to move data. The challenge of data movement I think is often underplayed how challenging it is. But it's extremely tedious work. You know, you have to connect to dozens or hundreds of different data sources. Totally different schemas. Different database drivers, or systems altogether. And it break all the time. So the home-built stuff is really challenging to keep online. When it goes down, your business is not, you're not moving data. You can't actually get the insights you built in the first place. >> I remember I broke into this industry you know, in the days of mainframe. You used to read about them and they had this high-speed data mover. And it was this key component. And it had to be integrated. It had to be able to move, back then, it was large amounts of data fast. Today especially with the advent of Hadoop, people say okay don't move the data, keep it in place. Now that's not always practical. So talk about the sort of business case for starting a company that basically moves data. >> We handle basically the one step before. I agree with you completely. Many data analytical situations today where you're doing like the true, like business-oriented detail, where you're actually analyzing data and producing value, you can do it in place. Which is to say in your cluster, in your Spark cluster, all the different environments you can imagine. The problem is that if it's not there already, then it's a pretty monumental effort to get it there. I think we see. You know a lot of people think oh I can just write a SQL script, right? And that works for the first two to 20 tables you want to deploy. But for instance, in my background, I used to work at Square. I ran a data platform there. We had 500 tables we had to move on a regular basis. Coupled with a whole variety of other data sources. So at some point it becomes really impractical to hand-code these solutions. And even when you build your own framework, and you start to build tools internally, you know, it's not your job really, these companies, to build a world class data movement tool. It's their job to make the data valuable, right? And actually data movement is like utility, right. Providing the utility, really the thing to do is be productive and cost effective, right? So the reason why we build StreamSets, the reason why this thing is a thing in the first place, is because we think people shouldn't be in the business of building data movement tools. They should be in the business of moving their data and then getting on with it. Does that make sense? >> Yeah absolutely. So talk about how it all fits in with Spark generally and specifically Spark coming to the enterprise. >> Well in terms of how StreamSets connects to stuff, we deploy in every way you can imagine, whether you want to run your own premise, on your own machines, or in the Cloud. It's up to you to deploy however you like. We're not prescriptive about that. We often get deployed on the edge of clusters, wether it's your Hadoop cluster or your Spark cluster. And basically we try not to get in the way of these analysis tools. There are many great analytical tools out there like Spark is a great example. We focus really on the moving of data. So what you'll see is someone will build a Spark streaming application or some big Spark SQL thing that actually produces the reports. And we plug in ahead of that. So if you're data is being collected from, you know, Edge web logs or some thing or some Kafka thing or a third party AVI or scripting website. We do the first collection. And then it's usually picked up from there with the next tool. Whether it's Spark or other things. I'm trying to think about the right way to put this. I think that people who write Spark they should focus on the part that's like the business value for them. They should be doing the thing that actually is applying the machine learning model, or is producing the report that the CEO or CTO wants to see. And move away from the ingest part of the business. Does that make sense? >> [] Yeah. >> Yeah. When the Spark guys sort of aspire to that by saying you don't have to worry about exactly when's delivery. And you know you can make sure this sort of guarantee, you've got guarantees that will get from point A to point B. >> Bryan: Yeah. >> Things like that. But all those sources of data and all those targets, writing all those adapters is, I mean, that's been a La Brea tar pit for many companies over time. >> In essence that is our business. I think that you touch on a good point. Spark can actually do some of these things right. There's not complete, but significant overlap in some cases. But the important difference is that Spark is a cluster tool for working with cluster data. And we're not going to beat you running a Spark application for consuming from Kafka to do your analysis. But you want to use Spark for reading local files? Do you want to use Spark for reading from a mainframe? Like these are things that StreamSets is built for. And that library of connectors you're talking about, it's our bread and butter. It's not your job as a data scientist, you know, applying Spark, to build a library of connectors. So actually the challenge is not the difficulty of building any one connector, because we have that down to an art now. But we can afford to invest, we can build a portfolio of connectors. But you as a user of Spark, can only afford to do it on demand. Reactive. And so that turn around time, of the cost it might take you to build that connector is pretty significant. And actually I often see the flow side. This is a problem I faced at Square, which was that people asked me to integrate new data sources, I had to say no. Because it was too rare, it was too unusual for what we had to do. We had other things to support. So the problem with that is that I have no idea what kind of opportunity cost I left behind. Like what kind of data we didn't get, kind of analysis we couldn't do. And with an approach like StreamSets, you can solve that problem sort of up front even. >> So sort of two follow ups. One is it would seem to be an evergreen effort to maintain the existing connectors. >> Bryan: Certainly. >> And two, is there a way to leverage connectors that others have built, like the Kafka connect type stuff. >> Truthfully we are a heavy-duty user of open source software so our actual product, if you dig in to what you see, it's a framework for executing pipelines. And it's for connecting other software into our product. So it's not like when we integrate Kafka we built a build brand new blue sky Kafka connector. We actually integrate what stuff is out there. So our idea is to bring as much of that stuff in there as we can. And really be part of the community. You know, our product is also open source. So we play well with the community. We have had people contribute connectors. People who say we love the product, we need it to connect to this other database. And then they do it for us. So it's been a pretty exciting situation. >> We were talking earlier off-camera, George and I have been talking all week about the badge workloads, interactive workloads, now you've got this sort of new emerging workloads, continuous screening workloads, which is in the name. What are you seeing there? And what kind of use cases is that enabling? >> So we're focused on mostly the continuous delivery workload. We also deliver the batch stuff. We're finding is people are moving farther and farther away from batch in general. Because batch was not the goal it was a means to the end. People wanted to get their data into their environment, so they could do their analysis. They want to run their daily reports, things like that. But ask any data scientist, they would rather the data show up immediately. So we're definitely seeing a lot of customers who want to do things like moving data live from a log file into Hadoop they can read immediately, in the order of minutes. We're trying to do our best to enable those kind of use cases. In particular we're seeing a lot of interest in the Spark arena, obviously that's kind of why we're here today. You know people want to add their event processing, or their aggregation, and analysis, like Spark, especially like Spark SQL. And they want that to be almost happening at the time of ingest. Not once it landed, but like when it's happening. So we're starting to build integration. We have kind of our foot in the door there, with our Spark processor. Which allows you to put a Spark workflow right in the middle of your data pipeline. Or as many of them as you want in fact. And we all sort of manage the lifecycle of that. And do all those connections as required to make your pipeline pretend to have a Spark processor in the middle. We really think that with that kind of workload, you can do your ingest, but you can also capture your real-time analytics along the way. And that doesn't replace batch reporting for say that'll happen after the fact. Our your daily reports or what have you. But it makes it that much easier for your data scientists to have, you know, a piece of intelligence that they had in flight. You know? >> I love talking to someone who's a practitioner now sort of working for a company that's selling technology. What do you see, from both perspectives, as Spark being good at? You know, what's the best fit? And what's it not good at? >> Well I think that Spark is following the arc of like Hadoop basically. It started out as infrastructure for engineers, for building really big scary things. But it's becoming more and more a productivity tool for analysts, data scientist, machine-learning experts. And we see that popping up all the time. And it's really exciting frankly, to think about these streaming analytics that can happen. These scoring machine-learning models. Really bringing a lot more power into the hands of these people who are not engineers. People who are much more focused on the semantic value of the data. And not the garbage in garbage out value of the data. >> You were talking before about it's really hard, data movement and the data's not always right. Data quality continues to be a challenge. >> Bryan: Yeah. >> Maybe comment on that. State the data quality and how the industry is dealing with that problem. >> It is hard, it is hard. I think that the traditional approach to data quality is to try and specify a quality up front. We take the opposite approach. We basically say that it's impossible to know that your data will be correct at all times. So we have what we call schema drift tools. So we try to go, we say like intent-driven approach. We're interacting with your data. Rather then a schema driven approach. So of course your data has an implicit schema as it's passing through the pipeline. Rather than saying, let's transform com three, we want you to use the name. We want you to be aware of what it is you're trying to actually change and affect. And the rest just kind of flows along with it. There's no magic bullet for every kind of data-quality issue or schema change that could possibly come into your pipeline. We try to do the best to make it easy for you to do effectively the best practice. The easiest thing that will survive the future, build robust data pipelines. This is one of the biggest challenges I think with like home-grown solutions. Is that it's really easy to build something that works. It's not easy to build something that works all the time. It's very easy to not imagine the edge cases. 'Cause it might take you a year until you've actually encountered you know, the first big problem. The real, the gotcha that you didn't consider when you were building your own thing. And those of us at StreamSets who have been in the industry and on the user side, we've had some of these experiences. So we're trying to export that knowledge in the product. >> Dave: Who do you guys sell to? >> Everybody. (laughing) We see a lot of success today with, we call it Hadoop replatforming. Which is people who are moving from their huge variety of data sources environment into like a Hadoop data-like kind of environment. Also Cloud, people are moving into the Cloud. The need a way for their data to get from wherever it is to where they want it to be. And certainly people could script these things manually. They could build their own tools for this. But it's just so much more productive to do it quickly in a UI. >> Is it an architect who's buying your product? Is it a developer? >> It's a variety. So I think our product resonates greatly with a developer. But also people who are higher up in the chain. People who are trying to design their whole topology. I think the thing I love to talk about is everyone, when they start on a data project, they sit down and they draw this beautiful diagram with boxes and arrows that says here's where the data's going to go. But a month later, it works, kind of, but it's never that thing. >> Dave: Yeah because the data is just everywhere. >> Exactly. And the reality is that what you have to do to make it work correctly within SLA guidelines and things like that is so not what you imagined. But then you can almost never go backwards. You can never say based on what I have, give me the box scenarios, because it's a systems analysis effort that no one has the time to engage in. But since StreamSets is actually instruments, every step of the pipeline, and we have a view into how all your pipelines actually fit together. We can give you that. We can just generate it. So we actually have a product. We've been talking about the StreamSet data collector which is the core like data movement product. We have like our enterprise edition, which is called the Dataflow Performance Manager, or DPM, It basically gives you a lot of collaboration and enterprise grade authentication. And access control, and the commander control features. So it aggregates your metrics across all your data collectors. It helps you visualize your topology. So people like your director of analytics, or your CIO, who want to know is everything okay? We have a dashboard for them now. And that's really powerful. It's a beautiful UI. And it's really a platform for us to build visualizations with more intelligence. That looks across your whole infrastructure. >> Dave: That's good. >> Yeah. And then the thing is this is strangely kind of unprecedented. Because, you know, again, the engineer who wants to build this himself would say, I could just deploy Graphite. And all of a sudden I've got graphs it's fine right. But they're missing the details. What about the systems that aren't under your control? What about the failure cases? All these things, these are the things we tackle. 'Cause it's our business we can afford to invest massively and make this a really first-class data engineering environment. >> Would it be fair to say that Kafka sort of as it exists today is just data movement built on a log, but that it doesn't do the analytics. And it doesn't really yet, maybe it's just beginning to do some of the monitoring you know, with a dashboard, or that's a statement of direction. Would it be fair to say that you can layer on top of that? Or you can substitute on top of it with all the analytics? And then when you want the really fancy analytic soup, you know, call out to Spark. >> Sure, I would say that for one thing we definitely want to stay out of the analytics base. We think there's many great analytics tools out there like Spark. We also are not a storage tool. In fact, we're kind of like, we're queue-like but we view ourselves more like, if there's a pipe and a pump, we're the pump. And Kafka is the pipe. I think that from like a monitoring perspective, we monitor Kafka indirectly. 'Cause if we know what's coming out, and we know what's going in later, we can give you the stats. And that's actually what's important. This is actually one of the challenges of having sort of a home-grown or disconnected solution, is that stitching together so you understand the end to end is extremely difficult. 'Cause if you have a relational database, and a Kafka, and a Hadoop, and a Spark job, sure you can monitor all those things. They all have their own UIs. But if you can't understand what the is on the whole system you're left like with four windows open trying to figure out where things connect. And it's just too difficult. >> So just on a sort of a positioning point of view for someone who's trying to make sense out of all the choices they have, to what extent would you call yourself a management framework for someone who's building these pipelines, whether from Scratch, or buying components. And to what extent is it, I guess, when you talk about a pump, that would be almost like the run time part of it. >> Bryan: Yeah, yeah. >> So you know there's a control plane and then there's a data plane. >> Bryan: Sure. >> What's the mix? >> Yeah well we do both for sure. I mean I would say that the data point for us is StreamSet's data collector. We move data, we physically move the data. We have our own internal pipeline execution engine. So it doesn't presuppose any other existing technologies, not dependent on Hadoop or Spark or Kafka or anything. You know to some degree data collector is also the control plane for small deployments. Because it does give you start to stop commanding control. Some metrics monitoring, things like that. Now, what people need to expand beyond the realm of single data collector, when they have enterprises with more than one business unit, or data center, or security zone, things like that. You don't just deploy one data collector, you deploy a bunch, dozens or hundreds. And in that case, that's where dataflow performance manager again comes in, as that control plane. Now dataflow performance manager has no data in it. It does not pass your actual business data. But it does again aggregate all of your metrics from all your data collectors and gives you a unified view across your whole enterprise. >> And one more follow-up along those lines. When you have a multi-vendor stack, or a multi-vendor pipeline. >> Bryan: Yeah. >> What gives you the meta view? >> Well we're at the ins and outs. We see the interfaces. So in theory if someone were to consume data out of Kafka do something right. Then there's another job later, like a Spark job. >> George: Yeah. >> So we don't automatic visibility for that. But our plan in the future is to expand as dataflow performance manager to take third party metric sources effectively. To broaden the view of your entire enterprise. >> You've got a bunch of stuff on your website here which is kind of interesting. Talking about some of the things we talked about. You know taming data drift is one of your papers. The silent killer of data integrity. And some other good resources. So just in sort of closing, how do we learn more? What would you suggest? >> Sure, yeah please visit the website. The product is open source and free to download. Data collector is free to download. I would encourage people to try it out. It's really easy to take for a spin. And if you love it you should check out our community. We have a very active Slack channel and Google group, which you can find from the website as well. And there's also a blog full of tutorials. >> Yeah well you're solving gnarly problems that a lot of companies just don't want to deal with. That's good thanks for doing the dirty work, we appreciate it. >> Yeah my pleasure. >> Alright Bryan thanks for coming on "The Cube." >> Thanks for having me. >> Good to see you. You're welcome. Keep right there buddy we'll be back with our next guest. This is "The Cube" we're live from Boston Spark Summit. Spark Summit East #SparkSummit right back. >> Narrator: Since the dawn.

Published Date : Feb 9 2017

SUMMARY :

Brought to you by Databricks. He's the vice president of engineering at StreamSets. and some of the integrations you're doing. And it break all the time. And it had to be integrated. all the different environments you can imagine. generally and specifically Spark coming to the enterprise. And move away from the ingest part of the business. When the Spark guys sort of aspire to that But all those sources of data and all those targets, of the cost it might take you to build that connector to maintain the existing connectors. like the Kafka connect type stuff. And really be part of the community. about the badge workloads, interactive workloads, We have kind of our foot in the door there, What do you see, from both perspectives, And not the garbage in garbage out value of the data. data movement and the data's not always right. and how the industry is dealing with that problem. The real, the gotcha that you didn't consider Also Cloud, people are moving into the Cloud. I think the thing I love to talk about is And the reality is that what you have to do What about the systems that aren't under your control? And then when you want the really fancy And Kafka is the pipe. to what extent would you call yourself So you know there's a control plane and gives you a unified view across your whole enterprise. When you have a multi-vendor stack, We see the interfaces. But our plan in the future is to expand Talking about some of the things we talked about. And if you love it you should check out our community. That's good thanks for doing the dirty work, Good to see you.

ENTITIES

Entity	Category	Confidence
Bryan	PERSON	0.99+
Dave	PERSON	0.99+
Dave Volante	PERSON	0.99+
George Gilbert	PERSON	0.99+
George	PERSON	0.99+
Bryan Duxbury	PERSON	0.99+
StreamSets	ORGANIZATION	0.99+
first mile	QUANTITY	0.99+
Boston, Massachusetts	LOCATION	0.99+
dozens	QUANTITY	0.99+
Spark	TITLE	0.99+
500 tables	QUANTITY	0.99+
first	QUANTITY	0.99+
Google	ORGANIZATION	0.99+
20 tables	QUANTITY	0.99+
Kafka	TITLE	0.99+
hundreds	QUANTITY	0.99+
One	QUANTITY	0.99+
more than one business unit	QUANTITY	0.98+
Boston	LOCATION	0.98+
a year	QUANTITY	0.98+
Spark SQL	TITLE	0.98+
today	DATE	0.98+
first collection	QUANTITY	0.98+
one	QUANTITY	0.98+
a month later	DATE	0.98+
both	QUANTITY	0.98+
two	QUANTITY	0.98+
SQL	TITLE	0.98+
StreamSets	TITLE	0.98+
Today	DATE	0.97+
Databricks	ORGANIZATION	0.97+
Spark Summit East	LOCATION	0.97+
one data collector	QUANTITY	0.97+
Boston Spark Summit	LOCATION	0.97+
Spark Summit East 2017	EVENT	0.97+
Spark Summit East	EVENT	0.96+
one step	QUANTITY	0.96+
Cleveland	LOCATION	0.95+
both perspectives	QUANTITY	0.95+
StreamSet	ORGANIZATION	0.95+
Slack	ORGANIZATION	0.95+
Square	ORGANIZATION	0.95+
Hadoop	TITLE	0.94+
four windows	QUANTITY	0.93+
first two	QUANTITY	0.93+
Spark Summit	EVENT	0.93+
single data collector	QUANTITY	0.92+

Arun Murthy, Hortonworks - Spark Summit East 2017 - #SparkSummit - #theCUBE

>> [Announcer] Live, from Boston, Massachusetts, it's the Cube, covering Spark Summit East 2017, brought to you by Data Breaks. Now, your host, Dave Alante and George Gilbert. >> Welcome back to snowy Boston everybody, this is The Cube, the leader in live tech coverage. Arun Murthy is here, he's the founder and vice president of engineering at Horton Works, father of YARN, can I call you that, godfather of YARN, is that fair, or? (laughs) Anyway. He's so, so modest. Welcome back to the Cube, it's great to see you. >> Pleasure to have you. >> Coming off the big keynote, (laughs) you ended the session this morning, so that was great. Glad you made it in to Boston, and uh, lot of talk about security and governance, you know we've been talking about that years, it feels like it's truly starting to come into the main stream Arun, so. >> Well I think it's just a reflection of what customers are doing with the tech now. Now, three, four years ago, a lot of it was pilots, a lot of it was, you know, people playing with the tech. But increasingly, it's about, you know, people actually applying stuff in production, having data, system of record, running workloads both on prem and on the cloud, cloud is sort of becoming more and more real at mainstream enterprises. So a lot of it means, as you take any of the examples today any interesting app will have some sort of real time data feed, it's probably coming out from a cell phone or sensor which means that data is actually not, in most cases not coming on prem, it's actually getting collected in a local cloud somewhere, it's just more cost effective, why would we put up 25 data centers if you don't have to, right? So then you got to connect that data, production data you have or customer data you have or data you might have purchased and then join them up, run some interesting analytics, do geobased real time threat detection, cyber security. A lot of it means that you need a common way to secure data, govern it, and that's where we see the action, I think it's a really good sign for the market and for the community that people are pushing on these dimensions of the broader, because, getting pushed in this dimension because it means that people are actually using it for real production work loads. >> Well in the early days of Hadoop you really didn't talk that much about cloud. >> Yeah. >> You know, and now, >> Absolutely. >> It's like, you know, duh, cloud. >> Yeah. >> It's everywhere, and of course the whole hybrid cloud thing comes into play, what are you seeing there, what are things you can do in a hybrid, you know, or on prem that you can't do in a public cloud and what's the dynamic look like? >> Well, it's definitely not an either or, right? So what we're seeing is increasingly interesting apps need data which are born in the cloud and they'll stay in the cloud, but they also need transactional data which stays on prem, you might have an EDW for example, right? >> Right. >> There's not a lot of, you know, people want to solve business problems and not just move data from one place to another, right? Or back from one place to another, so it's not interesting to move an EDW to the cloud, and similarly it's not interesting to bring your IOT data or sensor data back into on-prem, right? Just makes sense. So naturally what happens is, you know, at Hortonworks we talk of kinds of modern app or a modern data app, which means a modern data app has to spare, has to sort of, you know, it can pass both on-prem data and cloud data. >> Yeah, you talked about that in your keynote years ago. Furio said that the data is the new development kit. And now you're seeing the apps are just so dang rich, >> Exactly, exactly. >> And they have to span >> Absolutely. >> physical locations, >> Yeah. >> But then this whole thing of IOT comes up, we've been having a conversation on The Cube, last several Cubes of, okay, how much stays out, how much stays in, there's a lot of debates about that, there's reasons not to bring it in, but you talked today about some of the important stuff will come back. >> Yeah. >> So the way this is, this all is going to be, you know, there's a lot of data that should be born in the cloud and stay there, the IOT data, but then what will happen increasingly is, key summaries of the data will move back and forth, so key summaries of your EDW will move to the cloud, sometimes key summaries of your IOT data, you know, you want to do some sort of historical training in analytics, that will come back on-prem, so I think there's a bi-directional data movement, but it just won't be all the data, right? It'll be key interesting summaries of the data but not all of it. >> And a lot of times, people say well it doesn't matter where it lives, cloud should be an operating model, not a place where you put data or applications, and while that's true and we would agree with that, from a customer standpoint it matters in terms of performance and latency issues and cost and regulation, >> And security and governance. >> Yeah. >> Absolutely. >> You need to think those things through. >> Exactly, so I mean, so that's what we're focused on, to make sure that you have a common security and governance model regardless of where data is, so you can think of it as, infrastructure you own and infrastructure you lease. >> Right. >> Right? Now, the details matter of course, when you go to the cloud you lose S3 for example or ADLS from Microsoft, but you got to make sure that there's a common sort of security governance front and top of it, in front of it, as an example one of the things that, you know, in the open source community, Ranger's a really sort of key project right now from a security authorization and authentication standpoint. We've done a lot of work with our friends at Microsoft to make sure, you can actually now manage data in Wasabi which is their object store, data stream, natively with Ranger, so you can set a policy that says only Dave can access these files, you know, George can access these columns, that sort of stuff is natively done on the Microsoft platform thanks to the relationship we have with them. >> Right. >> So that's actually really interesting for the open source communities. So you've talked about sort of commodity storage at the bottom layer and even if they're different sort of interfaces and implementations, it's still commodity storage, and now what's really helpful to customers is that they have a common security model, >> Exactly. >> Authorization, authentication, >> Authentication, lineage prominence, >> Oh okay. >> You want to make sure all of these are common sources across. >> But you've mentioned off of the different data patterns, like the stuff that might be streaming in on the cloud, what, assuming you're not putting it into just a file system or an object store, and you want to sort of merge it with >> Yeah. >> Historical data, so what are some of the data stores other than the file system, in other words, newfangled databases to manage this sort of interaction? >> So I think what you're saying is, we certainly have the raw data, the raw data is going to line up in whatever cloud native storage, >> Yeah. >> It's going to be Amazon, Wasabi, ADLS, Google Storage. But then increasingly you want, so now the patterns change so you have raw data, you have some sort of an ETL process, what's interesting in the cloud is that even the process data or, if you take the unstructured raw data and structure it, that structured data also needs to live on the cloud platform, right? The reason that's important is because A, it's cheaper to use the native platform rather than set up your own database on top of it. The other one is you also want to take advantage of all the native sources that the cloud storage provides, so for example, linking your application. So automatically data in Wasabi, you know, if you can set up a policy and easily say this structured data stable that I have of which is a summary of all the IOT activity in the last 24 hours, you can, using the cloud provider's technologies you can actually make it show up easily in Europe, like you don't have to do any work, right? So increasingly what we Hortonworks focused a lot on is to make sure that we, all of the computer engines, whether it's Spark or Hive or, you know, or MapReduce, it doesn't really matter, they're all natively working on the cloud provider's storage platform. >> [George] Okay. >> Right, so, >> Okay. >> That's a really key consideration for us. >> And the follow up to that, you know, there's a bit of a misconception that Spark replaces Hadoop, but it actually can be a processing, a compute engine for, >> Yeah. >> That can compliment or replace some of the compute engines in Hadoop, help us frame, how you talk about it with your customers. >> For us it's really simple, like in the past, the only option you had on Hadoop to do any computation was MapReduce, that was, I started working in MapReduce 11 years ago, so as you can imagine, it's a pretty good run for any technology, right? Spark is definitely the interesting sort of engine for sort of the, anything from mission learning to ETL for data on top of Hadoop. But again, what we focus a lot on is to make sure that every time we bring in, so right now, when we started on HTP, the first on HTP had about nine open source projects literally just nine. Today, the last one we shipped was 2.5, HTP 2.5 had about 27 I think, like it's a huge sort of explosion, right? But the problem with that is not just that we have 27 projects, the problem is that you're going to make sure each of the 27 work with all the 26 others. >> It's a QA nightmare. >> Exactly. So that integration is really key, so same thing with Spark, we want to make sure you have security and YARN (mumbles), like you saw in the demo today, you can now run Spark SQL but also make sure you get low level (mumbles) masking, all of the enterprise capabilities that you need, and I was at a financial services three or four weeks ago in Chicago. Today, to do equivalent of what I showed today on demo, they need literally, they have a classic ADW, and they have to maintain anywhere between 1500 to 2500 views of the same database, that's a nightmare as you can imagine. Now the fact that you can do this on the raw data using whether it's Hive or Spark or Peg or MapReduce, it doesn't really matter, it's really key, and that's the thing we push to make sure things like YARN security work across all the stacks, all the open source techs. >> So that makes life better, a simplification use case if you will, >> Yeah. >> What are some of the other use cases that you're seeing things like Spark enable? >> Machine learning is a really big one. Increasingly, every product is going to have some, people call it, machine learning and AI and deep learning, there's a lot of techniques out there, but the key part is you want to build a predictive model, in the past (mumbles) everybody want to build a model and score what's happening in the real world against model, but equally important make sure the model gets updated as more data comes in on and actually as the model scores does get smaller over time. So that's something we see all over, so for example, even within our own product, it's not just us enabling this for the customer, for example at Hortonworks we have a product called SmartSense which allows you to optimize how people use Hadoop. Where the, what are the opportunities for you to explore deficiencies within your own Hadoop system, whether it's Spark or Hive, right? So we now put mesh learning into SmartSense. And show you that customers who are running queries like you are running, Mr. Customer X, other customers like you are tuning Hadoop this way, they're running this sort of config, they're using these sort of features in Hadoop. That allows us to actually make the product itself better all the way down the pipe. >> So you're improving the scoring algorithm or you're sort of replacing it with something better? >> What we're doing there is just helping them optimize their Hadoop deploys. >> Yep. >> Right? You know, configuration and tuning and kernel settings and network settings, we do that automatically with SmartSense. >> But the customer, you talked about scoring and trying to, >> Yeah. >> They're tuning that, improving that and increasing the probability of it's accuracy, or is it? >> It's both. >> Okay. >> So the thing is what they do is, you initially come with a hypothesis, you have some amount of data, right? I'm a big believer that over time, more data, you're better off spending more, getting more data into the system than to tune that algorithm financially, right? >> Interesting, okay. >> Right, so you know, for example, you know, talk to any of the big guys on Facebook because they'll do the same, what they'll say is it's much better to get, to spend your time getting 10x data to the system and improving the model rather than spending 10x the time and improving the model itself on day one. >> Yeah, but that's a key choice, because you got to >> Exactly. >> Spend money on doing either, >> One of them. >> And you're saying go for the data. >> Go for the data. >> At least now. >> Yeah, go for data, what happens is the good part of that is it's not just the model, it's the, what you got to really get through is the entire end to end flow. >> Yeah. >> All the way from data aggregation to ingestion to collection to scoring, all that aspect, you're better off sort of walking through the paces like building the entire end to end product rather than spending time in a silo trying to make a lot of change. >> We've talked to a lot of machine learning tool vendors, application vendors, and it seems like we got to the point with Big Data where we put it in a repository then we started doing better at curating it and understanding it then starting to do a little bit exploration with business intelligence, but with machine learning, we don't have something that does this end to end, you know, from acquiring the data, building the model to operationalizing it, where are we on that, who should we look to for that? >> It's definitely very early, I mean if you look at, even the EDW space, for example, what is EDW? EDW is ingestion, ETL, and then sort of fast query layer, Olap BI, on and on and on, right? So that's the full EDW flow, I don't think as a market, I mean, it's really early in this space, not only as an overall industry, we have that end to end sort of industrialized design concept, it's going to take time, but a lot of people are ahead, you know, the Google's a world ahead, over time a lot of people will catch up. >> We got to go, I wish we had more time, I had so many other questions for you but I know time is tight in our schedule, so thanks so much Arun, >> Appreciate it. For coming on, appreciate it, alright, keep right there everybody, we'll be back with our next guest, it's The Cube, we're live from Spark Summit East in Boston, right back. (upbeat music)

Published Date : Feb 9 2017

SUMMARY :

brought to you by Data Breaks. father of YARN, can I call you that, Glad you made it in to Boston, So a lot of it means, as you take any of the examples today you really didn't talk that has to sort of, you know, it can pass both on-prem data Yeah, you talked about that in your keynote years ago. but you talked today about some of the important stuff So the way this is, this all is going to be, you know, And security and You need to think those so that's what we're focused on, to make sure that you have as an example one of the things that, you know, in the open So that's actually really interesting for the open source You want to make sure all of these are common sources in the last 24 hours, you can, using the cloud provider's in Hadoop, help us frame, how you talk about it with like in the past, the only option you had on Hadoop all of the enterprise capabilities that you need, Where the, what are the opportunities for you to explore What we're doing there is just helping them optimize and network settings, we do that automatically for example, you know, talk to any of the big guys is it's not just the model, it's the, what you got to really like building the entire end to end product rather than but a lot of people are ahead, you know, the Google's everybody, we'll be back with our next guest, it's The Cube,

ENTITIES

Entity	Category	Confidence
Dave	PERSON	0.99+
George Gilbert	PERSON	0.99+
Dave Alante	PERSON	0.99+
Arun Murthy	PERSON	0.99+
Europe	LOCATION	0.99+
Microsoft	ORGANIZATION	0.99+
10x	QUANTITY	0.99+
Boston	LOCATION	0.99+
Chicago	LOCATION	0.99+
Amazon	ORGANIZATION	0.99+
George	PERSON	0.99+
Arun	PERSON	0.99+
Wasabi	ORGANIZATION	0.99+
25 data centers	QUANTITY	0.99+
Today	DATE	0.99+
Hadoop	TITLE	0.99+
Wasabi	LOCATION	0.99+
YARN	ORGANIZATION	0.99+
Facebook	ORGANIZATION	0.99+
ADLS	ORGANIZATION	0.99+
Hortonworks	ORGANIZATION	0.99+
Horton Works	ORGANIZATION	0.99+
today	DATE	0.99+
Data Breaks	ORGANIZATION	0.99+
1500	QUANTITY	0.98+
SmartSense	TITLE	0.98+
S3	TITLE	0.98+
Boston, Massachusetts	LOCATION	0.98+
One	QUANTITY	0.98+
27 projects	QUANTITY	0.98+
three	DATE	0.98+
Google	ORGANIZATION	0.98+
Furio	PERSON	0.98+
Spark	TITLE	0.98+
2500 views	QUANTITY	0.98+
first	QUANTITY	0.97+
Spark Summit East	LOCATION	0.97+
both	QUANTITY	0.97+
Spark SQL	TITLE	0.97+
Google Storage	ORGANIZATION	0.97+
26	QUANTITY	0.96+
Ranger	ORGANIZATION	0.96+
four weeks ago	DATE	0.95+
one	QUANTITY	0.94+
each	QUANTITY	0.94+
four years ago	DATE	0.94+
11 years ago	DATE	0.93+
27 work	QUANTITY	0.9+
MapReduce	TITLE	0.89+
Hive	TITLE	0.89+
this morning	DATE	0.88+
EDW	TITLE	0.88+
about nine open source	QUANTITY	0.88+
day one	QUANTITY	0.87+
nine	QUANTITY	0.86+
years	DATE	0.84+
Olap	TITLE	0.83+
Cube	ORGANIZATION	0.81+
a lot of data	QUANTITY	0.8+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Spark SQL: