Data Power Panel V3

(upbeat music) >> The stampede to cloud and massive VC investments has led to the emergence of a new generation of object store based data lakes. And with them two important trends, actually three important trends. First, a new category that combines data lakes and data warehouses aka the lakehouse is emerged as a leading contender to be the data platform of the future. And this novelty touts the ability to address data engineering, data science, and data warehouse workloads on a single shared data platform. The other major trend we've seen is query engines and broader data fabric virtualization platforms have embraced NextGen data lakes as platforms for SQL centric business intelligence workloads, reducing, or somebody even claim eliminating the need for separate data warehouses. Pretty bold. However, cloud data warehouses have added complimentary technologies to bridge the gaps with lakehouses. And the third is many, if not most customers that are embracing the so-called data fabric or data mesh architectures. They're looking at data lakes as a fundamental component of their strategies, and they're trying to evolve them to be more capable, hence the interest in lakehouse, but at the same time, they don't want to, or can't abandon their data warehouse estate. As such we see a battle royale is brewing between cloud data warehouses and cloud lakehouses. Is it possible to do it all with one cloud center analytical data platform? Well, we're going to find out. My name is Dave Vellante and welcome to the data platform's power panel on theCUBE. Our next episode in a series where we gather some of the industry's top analysts to talk about one of our favorite topics, data. In today's session, we'll discuss trends, emerging options, and the trade offs of various approaches and we'll name names. Joining us today are Sanjeev Mohan, who's the principal at SanjMo, Tony Baers, principal at dbInsight. And Doug Henschen is the vice president and principal analyst at Constellation Research. Guys, welcome back to theCUBE. Great to see you again. >> Thank guys. Thank you. >> Thank you. >> So it's early June and we're gearing up with two major conferences, there's several database conferences, but two in particular that were very interested in, Snowflake Summit and Databricks Data and AI Summit. Doug let's start off with you and then Tony and Sanjeev, if you could kindly weigh in. Where did this all start, Doug? The notion of lakehouse. And let's talk about what exactly we mean by lakehouse. Go ahead. >> Yeah, well you nailed it in your intro. One platform to address BI data science, data engineering, fewer platforms, less cost, less complexity, very compelling. You can credit Databricks for coining the term lakehouse back in 2020, but it's really a much older idea. You can go back to Cloudera introducing their Impala database in 2012. That was a database on top of Hadoop. And indeed in that last decade, by the middle of that last decade, there were several SQL on Hadoop products, open standards like Apache Drill. And at the same time, the database vendors were trying to respond to this interest in machine learning and the data science. So they were adding SQL extensions, the likes Hudi and Vertical we're adding SQL extensions to support the data science. But then later in that decade with the shift to cloud and object storage, you saw the vendor shift to this whole cloud, and object storage idea. So you have in the database camp Snowflake introduce Snowpark to try to address the data science needs. They introduced that in 2020 and last year they announced support for Python. You also had Oracle, SAP jumped on this lakehouse idea last year, supporting both the lake and warehouse single vendor, not necessarily quite single platform. Google very recently also jumped on the bandwagon. And then you also mentioned, the SQL engine camp, the Dremios, the Ahanas, the Starbursts, really doing two things, a fabric for distributed access to many data sources, but also very firmly planning that idea that you can just have the lake and we'll help you do the BI workloads on that. And then of course, the data lake camp with the Databricks and Clouderas providing a warehouse style deployments on top of their lake platforms. >> Okay, thanks, Doug. I'd be remiss those of you who me know that I typically write my own intros. This time my colleagues fed me a lot of that material. So thank you. You guys make it easy. But Tony, give us your thoughts on this intro. >> Right. Well, I very much agree with both of you, which may not make for the most exciting television in terms of that it has been an evolution just like Doug said. I mean, for instance, just to give an example when Teradata bought AfterData was initially seen as a hardware platform play. In the end, it was basically, it was all those after functions that made a lot of sort of big data analytics accessible to SQL. (clears throat) And so what I really see just in a more simpler definition or functional definition, the data lakehouse is really an attempt by the data lake folks to make the data lake friendlier territory to the SQL folks, and also to get into friendly territory, to all the data stewards, who are basically concerned about the sprawl and the lack of control in governance in the data lake. So it's really kind of a continuing of an ongoing trend that being said, there's no action without counter action. And of course, at the other end of the spectrum, we also see a lot of the data warehouses starting to edit things like in database machine learning. So they're certainly not surrendering without a fight. Again, as Doug was mentioning, this has been part of a continual blending of platforms that we've seen over the years that we first saw in the Hadoop years with SQL on Hadoop and data warehouses starting to reach out to cloud storage or should say the HDFS and then with the cloud then going cloud native and therefore trying to break the silos down even further. >> Now, thank you. And Sanjeev, data lakes, when we first heard about them, there were such a compelling name, and then we realized all the problems associated with them. So pick it up from there. What would you add to Doug and Tony? >> I would say, these are excellent points that Doug and Tony have brought to light. The concept of lakehouse was going on to your point, Dave, a long time ago, long before the tone was invented. For example, in Uber, Uber was trying to do a mix of Hadoop and Vertical because what they really needed were transactional capabilities that Hadoop did not have. So they weren't calling it the lakehouse, they were using multiple technologies, but now they're able to collapse it into a single data store that we call lakehouse. Data lakes, excellent at batch processing large volumes of data, but they don't have the real time capabilities such as change data capture, doing inserts and updates. So this is why lakehouse has become so important because they give us these transactional capabilities. >> Great. So I'm interested, the name is great, lakehouse. The concept is powerful, but I get concerned that it's a lot of marketing hype behind it. So I want to examine that a bit deeper. How mature is the concept of lakehouse? Are there practical examples that really exist in the real world that are driving business results for practitioners? Tony, maybe you could kick that off. >> Well, put it this way. I think what's interesting is that both data lakes and data warehouse that each had to extend themselves. To believe the Databricks hype it's that this was just a natural extension of the data lake. In point of fact, Databricks had to go outside its core technology of Spark to make the lakehouse possible. And it's a very similar type of thing on the part with data warehouse folks, in terms of that they've had to go beyond SQL, In the case of Databricks. There have been a number of incremental improvements to Delta lake, to basically make the table format more performative, for instance. But the other thing, I think the most dramatic change in all that is in their SQL engine and they had to essentially pretty much abandon Spark SQL because it really, in off itself Spark SQL is essentially stop gap solution. And if they wanted to really address that crowd, they had to totally reinvent SQL or at least their SQL engine. And so Databricks SQL is not Spark SQL, it is not Spark, it's basically SQL that it's adapted to run in a Spark environment, but the underlying engine is C++, it's not scale or anything like that. So Databricks had to take a major detour outside of its core platform to do this. So to answer your question, this is not mature because these are all basically kind of, even though the idea of blending platforms has been going on for well over a decade, I would say that the current iteration is still fairly immature. And in the cloud, I could see a further evolution of this because if you think through cloud native architecture where you're essentially abstracting compute from data, there is no reason why, if let's say you are dealing with say, the same basically data targets say cloud storage, cloud object storage that you might not apportion the task to different compute engines. And so therefore you could have, for instance, let's say you're Google, you could have BigQuery, perform basically the types of the analytics, the SQL analytics that would be associated with the data warehouse and you could have BigQuery ML that does some in database machine learning, but at the same time for another part of the query, which might involve, let's say some deep learning, just for example, you might go out to let's say the serverless spark service or the data proc. And there's no reason why Google could not blend all those into a coherent offering that's basically all triggered through microservices. And I just gave Google as an example, if you could generalize that with all the other cloud or all the other third party vendors. So I think we're still very early in the game in terms of maturity of data lakehouses. >> Thanks, Tony. So Sanjeev, is this all hype? What are your thoughts? >> It's not hype, but completely agree. It's not mature yet. Lakehouses have still a lot of work to do, so what I'm now starting to see is that the world is dividing into two camps. On one hand, there are people who don't want to deal with the operational aspects of vast amounts of data. They are the ones who are going for BigQuery, Redshift, Snowflake, Synapse, and so on because they want the platform to handle all the data modeling, access control, performance enhancements, but these are trade off. If you go with these platforms, then you are giving up on vendor neutrality. On the other side are those who have engineering skills. They want the independence. In other words, they don't want vendor lock in. They want to transform their data into any number of use cases, especially data science, machine learning use case. What they want is agility via open file formats using any compute engine. So why do I say lakehouses are not mature? Well, cloud data warehouses they provide you an excellent user experience. That is the main reason why Snowflake took off. If you have thousands of cables, it takes minutes to get them started, uploaded into your warehouse and start experimentation. Table formats are far more resonating with the community than file formats. But once the cost goes up of cloud data warehouse, then the organization start exploring lakehouses. But the problem is lakehouses still need to do a lot of work on metadata. Apache Hive was a fantastic first attempt at it. Even today Apache Hive is still very strong, but it's all technical metadata and it has so many different restrictions. That's why we see Databricks is investing into something called Unity Catalog. Hopefully we'll hear more about Unity Catalog at the end of the month. But there's a second problem. I just want to mention, and that is lack of standards. All these open source vendors, they're running, what I call ego projects. You see on LinkedIn, they're constantly battling with each other, but end user doesn't care. End user wants a problem to be solved. They want to use Trino, Dremio, Spark from EMR, Databricks, Ahana, DaaS, Frink, Athena. But the problem is that we don't have common standards. >> Right. Thanks. So Doug, I worry sometimes. I mean, I look at the space, we've debated for years, best of breed versus the full suite. You see AWS with whatever, 12 different plus data stores and different APIs and primitives. You got Oracle putting everything into its database. It's actually done some interesting things with MySQL HeatWave, so maybe there's proof points there, but Snowflake really good at data warehouse, simplifying data warehouse. Databricks, really good at making lakehouses actually more functional. Can one platform do it all? >> Well in a word, I can't be best at breed at all things. I think the upshot of and cogen analysis from Sanjeev there, the database, the vendors coming out of the database tradition, they excel at the SQL. They're extending it into data science, but when it comes to unstructured data, data science, ML AI often a compromise, the data lake crowd, the Databricks and such. They've struggled to completely displace the data warehouse when it really gets to the tough SLAs, they acknowledge that there's still a role for the warehouse. Maybe you can size down the warehouse and offload some of the BI workloads and maybe and some of these SQL engines, good for ad hoc, minimize data movement. But really when you get to the deep service level, a requirement, the high concurrency, the high query workloads, you end up creating something that's warehouse like. >> Where do you guys think this market is headed? What's going to take hold? Which projects are going to fade away? You got some things in Apache projects like Hudi and Iceberg, where do they fit Sanjeev? Do you have any thoughts on that? >> So thank you, Dave. So I feel that table formats are starting to mature. There is a lot of work that's being done. We will not have a single product or single platform. We'll have a mixture. So I see a lot of Apache Iceberg in the news. Apache Iceberg is really innovating. Their focus is on a table format, but then Delta and Apache Hudi are doing a lot of deep engineering work. For example, how do you handle high concurrency when there are multiple rights going on? Do you version your Parquet files or how do you do your upcerts basically? So different focus, at the end of the day, the end user will decide what is the right platform, but we are going to have multiple formats living with us for a long time. >> Doug is Iceberg in your view, something that's going to address some of those gaps in standards that Sanjeev was talking about earlier? >> Yeah, Delta lake, Hudi, Iceberg, they all address this need for consistency and scalability, Delta lake open technically, but open for access. I don't hear about Delta lakes in any worlds, but Databricks, hearing a lot of buzz about Apache Iceberg. End users want an open performance standard. And most recently Google embraced Iceberg for its recent a big lake, their stab at having supporting both lakes and warehouses on one conjoined platform. >> And Tony, of course, you remember the early days of the sort of big data movement you had MapR was the most closed. You had Horton works the most open. You had Cloudera in between. There was always this kind of contest as to who's the most open. Does that matter? Are we going to see a repeat of that here? >> I think it's spheres of influence, I think, and Doug very much was kind of referring to this. I would call it kind of like the MongoDB syndrome, which is that you have... and I'm talking about MongoDB before they changed their license, open source project, but very much associated with MongoDB, which basically, pretty much controlled most of the contributions made decisions. And I think Databricks has the same iron cloud hold on Delta lake, but still the market is pretty much associated Delta lake as the Databricks, open source project. I mean, Iceberg is probably further advanced than Hudi in terms of mind share. And so what I see that's breaking down to is essentially, basically the Databricks open source versus the everything else open source, the community open source. So I see it's a very similar type of breakdown that I see repeating itself here. >> So by the way, Mongo has a conference next week, another data platform is kind of not really relevant to this discussion totally. But in the sense it is because there's a lot of discussion on earnings calls these last couple of weeks about consumption and who's exposed, obviously people are concerned about Snowflake's consumption model. Mongo is maybe less exposed because Atlas is prominent in the portfolio, blah, blah, blah. But I wanted to bring up the little bit of controversy that we saw come out of the Snowflake earnings call, where the ever core analyst asked Frank Klutman about discretionary spend. And Frank basically said, look, we're not discretionary. We are deeply operationalized. Whereas he kind of poo-pooed the lakehouse or the data lake, et cetera, saying, oh yeah, data scientists will pull files out and play with them. That's really not our business. Do any of you have comments on that? Help us swing through that controversy. Who wants to take that one? >> Let's put it this way. The SQL folks are from Venus and the data scientists are from Mars. So it means it really comes down to it, sort that type of perception. The fact is, is that, traditionally with analytics, it was very SQL oriented and that basically the quants were kind of off in their corner, where they're using SaaS or where they're using Teradata. It's really a great leveler today, which is that, I mean basic Python it's become arguably one of the most popular programming languages, depending on what month you're looking at, at the title index. And of course, obviously SQL is, as I tell the MongoDB folks, SQL is not going away. You have a large skills base out there. And so basically I see this breaking down to essentially, you're going to have each group that's going to have its own natural preferences for its home turf. And the fact that basically, let's say the Python and scale of folks are using Databricks does not make them any less operational or machine critical than the SQL folks. >> Anybody else want to chime in on that one? >> Yeah, I totally agree with that. Python support in Snowflake is very nascent with all of Snowpark, all of the things outside of SQL, they're very much relying on partners too and make things possible and make data science possible. And it's very early days. I think the bottom line, what we're going to see is each of these camps is going to keep working on doing better at the thing that they don't do today, or they're new to, but they're not going to nail it. They're not going to be best of breed on both sides. So the SQL centric companies and shops are going to do more data science on their database centric platform. That data science driven companies might be doing more BI on their leagues with those vendors and the companies that have highly distributed data, they're going to add fabrics, and maybe offload more of their BI onto those engines, like Dremio and Starburst. >> So I've asked you this before, but I'll ask you Sanjeev. 'Cause Snowflake and Databricks are such great examples 'cause you have the data engineering crowd trying to go into data warehousing and you have the data warehousing guys trying to go into the lake territory. Snowflake has $5 billion in the balance sheet and I've asked you before, I ask you again, doesn't there has to be a semantic layer between these two worlds? Does Snowflake go out and do M&A and maybe buy ad scale or a data mirror? Or is that just sort of a bandaid? What are your thoughts on that Sanjeev? >> I think semantic layer is the metadata. The business metadata is extremely important. At the end of the day, the business folks, they'd rather go to the business metadata than have to figure out, for example, like let's say, I want to update somebody's email address and we have a lot of overhead with data residency laws and all that. I want my platform to give me the business metadata so I can write my business logic without having to worry about which database, which location. So having that semantic layer is extremely important. In fact, now we are taking it to the next level. Now we are saying that it's not just a semantic layer, it's all my KPIs, all my calculations. So how can I make those calculations independent of the compute engine, independent of the BI tool and make them fungible. So more disaggregation of the stack, but it gives us more best of breed products that the customers have to worry about. >> So I want to ask you about the stack, the modern data stack, if you will. And we always talk about injecting machine intelligence, AI into applications, making them more data driven. But when you look at the application development stack, it's separate, the database is tends to be separate from the data and analytics stack. Do those two worlds have to come together in the modern data world? And what does that look like organizationally? >> So organizationally even technically I think it is starting to happen. Microservices architecture was a first attempt to bring the application and the data world together, but they are fundamentally different things. For example, if an application crashes, that's horrible, but Kubernetes will self heal and it'll bring the application back up. But if a database crashes and corrupts your data, we have a huge problem. So that's why they have traditionally been two different stacks. They are starting to come together, especially with data ops, for instance, versioning of the way we write business logic. It used to be, a business logic was highly embedded into our database of choice, but now we are disaggregating that using GitHub, CICD the whole DevOps tool chain. So data is catching up to the way applications are. >> We also have databases, that trans analytical databases that's a little bit of what the story is with MongoDB next week with adding more analytical capabilities. But I think companies that talk about that are always careful to couch it as operational analytics, not the warehouse level workloads. So we're making progress, but I think there's always going to be, or there will long be a separate analytical data platform. >> Until data mesh takes over. (all laughing) Not opening a can of worms. >> Well, but wait, I know it's out of scope here, but wouldn't data mesh say, hey, do take your best of breed to Doug's earlier point. You can't be best of breed at everything, wouldn't data mesh advocate, data lakes do your data lake thing, data warehouse, do your data lake, then you're just a node on the mesh. (Tony laughs) Now you need separate data stores and you need separate teams. >> To my point. >> I think, I mean, put it this way. (laughs) Data mesh itself is a logical view of the world. The data mesh is not necessarily on the lake or on the warehouse. I think for me, the fear there is more in terms of, the silos of governance that could happen and the silo views of the world, how we redefine. And that's why and I want to go back to something what Sanjeev said, which is that it's going to be raising the importance of the semantic layer. Now does Snowflake that opens a couple of Pandora's boxes here, which is one, does Snowflake dare go into that space or do they risk basically alienating basically their partner ecosystem, which is a key part of their whole appeal, which is best of breed. They're kind of the same situation that Informatica was where in the early 2000s, when Informatica briefly flirted with analytic applications and realized that was not a good idea, need to redouble down on their core, which was data integration. The other thing though, that raises the importance of and this is where the best of breed comes in, is the data fabric. My contention is that and whether you use employee data mesh practice or not, if you do employee data mesh, you need data fabric. If you deploy data fabric, you don't necessarily need to practice data mesh. But data fabric at its core and admittedly it's a category that's still very poorly defined and evolving, but at its core, we're talking about a common meta data back plane, something that we used to talk about with master data management, this would be something that would be more what I would say basically, mutable, that would be more evolving, basically using, let's say, machine learning to kind of, so that we don't have to predefine rules or predefine what the world looks like. But so I think in the long run, what this really means is that whichever way we implement on whichever physical platform we implement, we need to all be speaking the same metadata language. And I think at the end of the day, regardless of whether it's a lake, warehouse or a lakehouse, we need common metadata. >> Doug, can I come back to something you pointed out? That those talking about bringing analytic and transaction databases together, you had talked about operationalizing those and the caution there. Educate me on MySQL HeatWave. I was surprised when Oracle put so much effort in that, and you may or may not be familiar with it, but a lot of folks have talked about that. Now it's got nowhere in the market, that no market share, but a lot of we've seen these benchmarks from Oracle. How real is that bringing together those two worlds and eliminating ETL? >> Yeah, I have to defer on that one. That's my colleague, Holger Mueller. He wrote the report on that. He's way deep on it and I'm not going to mock him. >> I wonder if that is something, how real that is or if it's just Oracle marketing, anybody have any thoughts on that? >> I'm pretty familiar with HeatWave. It's essentially Oracle doing what, I mean, there's kind of a parallel with what Google's doing with AlloyDB. It's an operational database that will have some embedded analytics. And it's also something which I expect to start seeing with MongoDB. And I think basically, Doug and Sanjeev were kind of referring to this before about basically kind of like the operational analytics, that are basically embedded within an operational database. The idea here is that the last thing you want to do with an operational database is slow it down. So you're not going to be doing very complex deep learning or anything like that, but you might be doing things like classification, you might be doing some predictives. In other words, we've just concluded a transaction with this customer, but was it less than what we were expecting? What does that mean in terms of, is this customer likely to turn? I think we're going to be seeing a lot of that. And I think that's what a lot of what MySQL HeatWave is all about. Whether Oracle has any presence in the market now it's still a pretty new announcement, but the other thing that kind of goes against Oracle, (laughs) that they had to battle against is that even though they own MySQL and run the open source project, everybody else, in terms of the actual commercial implementation it's associated with everybody else. And the popular perception has been that MySQL has been basically kind of like a sidelight for Oracle. And so it's on Oracles shoulders to prove that they're damn serious about it. >> There's no coincidence that MariaDB was launched the day that Oracle acquired Sun. Sanjeev, I wonder if we could come back to a topic that we discussed earlier, which is this notion of consumption, obviously Wall Street's very concerned about it. Snowflake dropped prices last week. I've always felt like, hey, the consumption model is the right model. I can dial it down in when I need to, of course, the street freaks out. What are your thoughts on just pricing, the consumption model? What's the right model for companies, for customers? >> Consumption model is here to stay. What I would like to see, and I think is an ideal situation and actually plays into the lakehouse concept is that, I have my data in some open format, maybe it's Parquet or CSV or JSON, Avro, and I can bring whatever engine is the best engine for my workloads, bring it on, pay for consumption, and then shut it down. And by the way, that could be Cloudera. We don't talk about Cloudera very much, but it could be one business unit wants to use Athena. Another business unit wants to use some other Trino let's say or Dremio. So every business unit is working on the same data set, see that's critical, but that data set is maybe in their VPC and they bring any compute engine, you pay for the use, shut it down. That then you're getting value and you're only paying for consumption. It's not like, I left a cluster running by mistake, so there have to be guardrails. The reason FinOps is so big is because it's very easy for me to run a Cartesian joint in the cloud and get a $10,000 bill. >> This looks like it's been a sort of a victim of its own success in some ways, they made it so easy to spin up single note instances, multi note instances. And back in the day when compute was scarce and costly, those database engines optimized every last bit so they could get as much workload as possible out of every instance. Today, it's really easy to spin up a new node, a new multi node cluster. So that freedom has meant many more nodes that aren't necessarily getting that utilization. So Snowflake has been doing a lot to add reporting, monitoring, dashboards around the utilization of all the nodes and multi node instances that have spun up. And meanwhile, we're seeing some of the traditional on-prem databases that are moving into the cloud, trying to offer that freedom. And I think they're going to have that same discovery that the cost surprises are going to follow as they make it easy to spin up new instances. >> Yeah, a lot of money went into this market over the last decade, separating compute from storage, moving to the cloud. I'm glad you mentioned Cloudera Sanjeev, 'cause they got it all started, the kind of big data movement. We don't talk about them that much. Sometimes I wonder if it's because when they merged Hortonworks and Cloudera, they dead ended both platforms, but then they did invest in a more modern platform. But what's the future of Cloudera? What are you seeing out there? >> Cloudera has a good product. I have to say the problem in our space is that there're way too many companies, there's way too much noise. We are expecting the end users to parse it out or we expecting analyst firms to boil it down. So I think marketing becomes a big problem. As far as technology is concerned, I think Cloudera did turn their selves around and Tony, I know you, you talked to them quite frequently. I think they have quite a comprehensive offering for a long time actually. They've created Kudu, so they got operational, they have Hadoop, they have an operational data warehouse, they're migrated to the cloud. They are in hybrid multi-cloud environment. Lot of cloud data warehouses are not hybrid. They're only in the cloud. >> Right. I think what Cloudera has done the most successful has been in the transition to the cloud and the fact that they're giving their customers more OnRamps to it, more hybrid OnRamps. So I give them a lot of credit there. They're also have been trying to position themselves as being the most price friendly in terms of that we will put more guardrails and governors on it. I mean, part of that could be spin. But on the other hand, they don't have the same vested interest in compute cycles as say, AWS would have with EMR. That being said, yes, Cloudera does it, I think its most powerful appeal so of that, it almost sounds in a way, I don't want to cast them as a legacy system. But the fact is they do have a huge landed legacy on-prem and still significant potential to land and expand that to the cloud. That being said, even though Cloudera is multifunction, I think it certainly has its strengths and weaknesses. And the fact this is that yes, Cloudera has an operational database or an operational data store with a kind of like the outgrowth of age base, but Cloudera is still based, primarily known for the deep analytics, the operational database nobody's going to buy Cloudera or Cloudera data platform strictly for the operational database. They may use it as an add-on, just in the same way that a lot of customers have used let's say Teradata basically to do some machine learning or let's say, Snowflake to parse through JSON. Again, it's not an indictment or anything like that, but the fact is obviously they do have their strengths and their weaknesses. I think their greatest opportunity is with their existing base because that base has a lot invested and vested. And the fact is they do have a hybrid path that a lot of the others lack. >> And of course being on the quarterly shock clock was not a good place to be under the microscope for Cloudera and now they at least can refactor the business accordingly. I'm glad you mentioned hybrid too. We saw Snowflake last month, did a deal with Dell whereby non-native Snowflake data could access on-prem object store from Dell. They announced a similar thing with pure storage. What do you guys make of that? Is that just... How significant will that be? Will customers actually do that? I think they're using either materialized views or extended tables. >> There are data rated and residency requirements. There are desires to have these platforms in your own data center. And finally they capitulated, I mean, Frank Klutman is famous for saying to be very focused and earlier, not many months ago, they called the going on-prem as a distraction, but clearly there's enough demand and certainly government contracts any company that has data residency requirements, it's a real need. So they finally addressed it. >> Yeah, I'll bet dollars to donuts, there was an EBC session and some big customer said, if you don't do this, we ain't doing business with you. And that was like, okay, we'll do it. >> So Dave, I have to say, earlier on you had brought this point, how Frank Klutman was poo-pooing data science workloads. On your show, about a year or so ago, he said, we are never going to on-prem. He burnt that bridge. (Tony laughs) That was on your show. >> I remember exactly the statement because it was interesting. He said, we're never going to do the halfway house. And I think what he meant is we're not going to bring the Snowflake architecture to run on-prem because it defeats the elasticity of the cloud. So this was kind of a capitulation in a way. But I think it still preserves his original intent sort of, I don't know. >> The point here is that every vendor will poo-poo whatever they don't have until they do have it. >> Yes. >> And then it'd be like, oh, we are all in, we've always been doing this. We have always supported this and now we are doing it better than others. >> Look, it was the same type of shock wave that we felt basically when AWS at the last moment at one of their reinvents, oh, by the way, we're going to introduce outposts. And the analyst group is typically pre briefed about a week or two ahead under NDA and that was not part of it. And when they dropped, they just casually dropped that in the analyst session. It's like, you could have heard the sound of lots of analysts changing their diapers at that point. >> (laughs) I remember that. And a props to Andy Jassy who once, many times actually told us, never say never when it comes to AWS. So guys, I know we got to run. We got some hard stops. Maybe you could each give us your final thoughts, Doug start us off and then-- >> Sure. Well, we've got the Snowflake Summit coming up. I'll be looking for customers that are really doing data science, that are really employing Python through Snowflake, through Snowpark. And then a couple weeks later, we've got Databricks with their Data and AI Summit in San Francisco. I'll be looking for customers that are really doing considerable BI workloads. Last year I did a market overview of this analytical data platform space, 14 vendors, eight of them claim to support lakehouse, both sides of the camp, Databricks customer had 32, their top customer that they could site was unnamed. It had 32 concurrent users doing 15,000 queries per hour. That's good but it's not up to the most demanding BI SQL workloads. And they acknowledged that and said, they need to keep working that. Snowflake asked for their biggest data science customer, they cited Kabura, 400 terabytes, 8,500 users, 400,000 data engineering jobs per day. I took the data engineering job to be probably SQL centric, ETL style transformation work. So I want to see the real use of the Python, how much Snowpark has grown as a way to support data science. >> Great. Tony. >> Actually of all things. And certainly, I'll also be looking for similar things in what Doug is saying, but I think sort of like, kind of out of left field, I'm interested to see what MongoDB is going to start to say about operational analytics, 'cause I mean, they're into this conquer the world strategy. We can be all things to all people. Okay, if that's the case, what's going to be a case with basically, putting in some inline analytics, what are you going to be doing with your query engine? So that's actually kind of an interesting thing we're looking for next week. >> Great. Sanjeev. >> So I'll be at MongoDB world, Snowflake and Databricks and very interested in seeing, but since Tony brought up MongoDB, I see that even the databases are shifting tremendously. They are addressing both the hashtag use case online, transactional and analytical. I'm also seeing that these databases started in, let's say in case of MySQL HeatWave, as relational or in MongoDB as document, but now they've added graph, they've added time series, they've added geospatial and they just keep adding more and more data structures and really making these databases multifunctional. So very interesting. >> It gets back to our discussion of best of breed, versus all in one. And it's likely Mongo's path or part of their strategy of course, is through developers. They're very developer focused. So we'll be looking for that. And guys, I'll be there as well. I'm hoping that we maybe have some extra time on theCUBE, so please stop by and we can maybe chat a little bit. Guys as always, fantastic. Thank you so much, Doug, Tony, Sanjeev, and let's do this again. >> It's been a pleasure. >> All right and thank you for watching. This is Dave Vellante for theCUBE and the excellent analyst. We'll see you next time. (upbeat music)

Published Date : Jun 2 2022

SUMMARY :

And Doug Henschen is the vice president Thank you. Doug let's start off with you And at the same time, me a lot of that material. And of course, at the and then we realized all the and Tony have brought to light. So I'm interested, the And in the cloud, So Sanjeev, is this all hype? But the problem is that we I mean, I look at the space, and offload some of the So different focus, at the end of the day, and warehouses on one conjoined platform. of the sort of big data movement most of the contributions made decisions. Whereas he kind of poo-pooed the lakehouse and the data scientists are from Mars. and the companies that have in the balance sheet that the customers have to worry about. the modern data stack, if you will. and the data world together, the story is with MongoDB Until data mesh takes over. and you need separate teams. that raises the importance of and the caution there. Yeah, I have to defer on that one. The idea here is that the of course, the street freaks out. and actually plays into the And back in the day when the kind of big data movement. We are expecting the end And the fact is they do have a hybrid path refactor the business accordingly. saying to be very focused And that was like, okay, we'll do it. So Dave, I have to say, the Snowflake architecture to run on-prem The point here is that and now we are doing that in the analyst session. And a props to Andy Jassy and said, they need to keep working that. Great. Okay, if that's the case, Great. I see that even the databases I'm hoping that we maybe have and the excellent analyst.

ENTITIES

Entity	Category	Confidence
Doug	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Dave	PERSON	0.99+
Tony	PERSON	0.99+
Uber	ORGANIZATION	0.99+
Frank	PERSON	0.99+
Frank Klutman	PERSON	0.99+
Tony Baers	PERSON	0.99+
Mars	LOCATION	0.99+
Doug Henschen	PERSON	0.99+
2020	DATE	0.99+
AWS	ORGANIZATION	0.99+
Venus	LOCATION	0.99+
Oracle	ORGANIZATION	0.99+
2012	DATE	0.99+
Databricks	ORGANIZATION	0.99+
Dell	ORGANIZATION	0.99+
Hortonworks	ORGANIZATION	0.99+
Holger Mueller	PERSON	0.99+
Andy Jassy	PERSON	0.99+
last year	DATE	0.99+
$5 billion	QUANTITY	0.99+
$10,000	QUANTITY	0.99+
14 vendors	QUANTITY	0.99+
Last year	DATE	0.99+
last week	DATE	0.99+
San Francisco	LOCATION	0.99+
SanjMo	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
8,500 users	QUANTITY	0.99+
Sanjeev	PERSON	0.99+
Informatica	ORGANIZATION	0.99+
32 concurrent users	QUANTITY	0.99+
two	QUANTITY	0.99+
Constellation Research	ORGANIZATION	0.99+
Mongo	ORGANIZATION	0.99+
Sanjeev Mohan	PERSON	0.99+
Ahana	ORGANIZATION	0.99+
DaaS	ORGANIZATION	0.99+
EMR	ORGANIZATION	0.99+
32	QUANTITY	0.99+
Atlas	ORGANIZATION	0.99+
Delta	ORGANIZATION	0.99+
Snowflake	ORGANIZATION	0.99+
Python	TITLE	0.99+
each	QUANTITY	0.99+
Athena	ORGANIZATION	0.99+
next week	DATE	0.99+

Clemence W. Chee & Christoph Sawade, HelloFresh

(upbeat music) >> Hello everyone. We're here at theCUBE startup showcase made possible by AWS. Thanks so much for joining us today. You know, when Zhamak Dehghani was formulating her ideas around data mesh, she wasn't the only one thinking about decentralized data architectures. HelloFresh was going into hyper-growth mode and realized that in order to support its scale, it needed to rethink how it thought about data. Like many companies that started in the early part of the last decade, HelloFresh relied on a monolithic data architecture and the internal team it had concerns about its ability to support continued innovation at high velocity. The company's data team began to think about the future and work backwards from a target architecture, which possessed many principles of so-called data mesh, even though they didn't use that term specifically. The company is a strong example of an early but practical pioneer of data mesh. Now, there are many practitioners and stakeholders involved in evolving the company's data architecture many of whom are listed here on this slide. Two are highlighted in red and joining us today. We're really excited to welcome you to theCUBE, Clemence Chee, who is the global senior director for data at HelloFresh, and Christoph Sawade, who's the global senior director of data also of course at HelloFresh. Folks, welcome. Thanks so much for making some time today and sharing your story. >> Thank you very much. >> Thanks, Dave. >> All right, let's start with HelloFresh. You guys are number one in the world in your field. You deliver hundreds of millions of meals each year to many, many millions of people around the globe. You're scaling. Christoph, tell us a little bit more about your company and its vision. >> Yeah. Should I start or Clemence? Maybe take over the first piece because Clemence has actually been longer a director at HelloFresh. >> Yeah go ahead Clemence. >> I mean, yes, about approximately six years ago I joined and HelloFresh, and I didn't think about the startup I was joining would eventually IPO. And just two years later, HelloFresh went public. And approximately three years and 10 months after HelloFresh was listed on the German stock exchange which was just last week, HelloFresh was included in the DAX Germany's leading stock market index and that, to mind a great, great milestone, and I'm really looking forward and I'm very excited for the future for HelloFresh and also our data. The vision that we have is to become the world's leading food solution group. And there are a lot of attractive opportunities. So recently we did launch and expand in Norway. This was in July. And earlier this year, we launched the US brand, Green Chef, in the UK as well. We're committed to launch continuously different geographies in the next coming years and have a strong path ahead of us. With the acquisition of ready to eat companies like factor in the US and the plant acquisition of Youfoodz in Australia, we are diversifying our offer, now reaching even more and more untapped customer segments and increase our total address for the market. So by offering customers and growing range of different alternatives to shop food and to consume meals, we are charging towards this vision and this goal to become the world's leading integrated food solutions group. >> Love it. You guys are on a rocket ship. You're really transforming the industry. And as you expand your TAM, it brings us to sort of the data as a core part of that strategy. So maybe you guys could talk a little bit about your journey as a company, specifically as it relates to your data journey. I mean, you began as a startup, you had a basic architecture and like everyone, you've made extensive use of spreadsheets, you built a Hadoop based system that started to grow. And when the company IPO'd, you really started to explode. So maybe describe that journey from a data perspective. >> Yes, Dave. So HelloFresh by 2015, approximately had evolved what amount, a classical centralized data management set up. So we grew very organically over the years, and there were a lot of very smart people around the globe, really building the company and building our infrastructure. This also means that there were a small number of internal and external sources, data sources, and a centralized BI team with a number of people producing different reports, different dashboards and, and products for our executives, for example, or for different operations teams to see a company's performance and knowledge was transferred just by our talking to each other face-to-face conversations. And the people in the data warehouse team were considered as the data wizard or as the ETL wizard. Very classical challenges. And it was ETL, who reserved, indicated the kind of like a style of knowledge of data management, right? So our central data warehouse team then was responsible for different type of verticals in different domains, different geographies. And all this setup gave us in the beginning, the flexibility to grow fast as a company in 2015. >> Christoph, anything to add to that? >> Yes, not explicitly to that one, but as, as Clemence said, right, this was kind of the setup that actually worked for us quite a while. And then in 2017, when HelloFresh went public, the company also grew rapidly. And just to give you an idea how that looked like as well, the tech departments have actually increased from about 40 people to almost 300 engineers. And in the same way as the business units, as there Clemence has described, also grew sustainably. So we continue to launch HelloFresh in new countries, launched new brands like Every Plate, and also acquired other brands like we have Factor. And that grows also from a data perspective, the number of data requests that the central (mumbles), we're getting become more and more and more, and also more and more complex. So that for the team meant that they had a fairly high mental load. So they had to achieve a very, or basically get a very deep understanding about the business and also suffered a lot from this context, switching back and forth. Essentially, they had to prioritize across our product requests from our physical product, digital product, from a physical, from, sorry, from the marketing perspective, and also from the central reporting teams. And in a nutshell, this was very hard for these people, and that altered situations that let's say the solution that we have built. We can not really optimal. So in a, in a, in a, in a nutshell, the central function became a bottleneck and slow down of all the innovation of the company. >> It's a classic case. Isn't it? I mean, Clemence, you see, you see the central team becomes a bottleneck, and so the lines of business, the marketing team, sales teams say "Okay, we're going to take things into our own hands." And then of course IT and the technical team is called in later to clean up the mess. Maybe, maybe I'm overstating it, but, but that's a common situation. Isn't it? >> Yeah this is what exactly happened. Right. So we had a bottleneck, we had those central teams, there was always a bit of tension. Analytics teams then started in those business domains like marketing, supply chain, finance, HR, and so on started really to build their own data solutions. At some point you have to get the ball rolling, right? And then continue the trajectory, which means then that the data pipelines didn't meet the engineering standards. And there was an increased need for maintenance and support from central teams. Hence over time, the knowledge about those pipelines and how to maintain a particular infrastructure, for example, left the company, such that most of those data assets and data sets that turned into a huge debt with decreasing data quality, also decreasing lack of trust, decreasing transparency. And this was an increasing challenge where a majority of time was spent in meeting rooms to align on, on data quality for example. >> Yeah. And the point you were making Christoph about context switching, and this is, this is a point that Zhamak makes quite often as we've, we've, we've contextualized our operational systems like our sales systems, our marketing systems, but not our, our data systems. So you're asking the data team, okay, be an expert in sales, be an expert in marketing, be an expert in logistics, be an expert in supply chain and it's start, stop, start, stop. It's a paper cut environment, and it's just not as productive. But, but, and the flip side of that is when you think about a centralized organization, you think, hey, this is going to be a very efficient way across functional team to support the organization, but it's not necessarily the highest velocity, most effective organizational structure. >> Yeah. So, so I agree with that piece, that's up to a certain scale. A centralized function has a lot of advantages, right? So it's a tool for everyone, which would go to a destined kind of expert team. However, if you see that you actually would like to accelerate that in specific as the type of growth. But you want to actually have autonomy on certain teams and move the teams, or let's say the data to the experts in these teams. And this, as you have mentioned, right, that increases mental load. And you can either internally start splitting your team into different kinds of sub teams focusing on different areas, however, that is then again, just adding another piece where actually collaboration needs to happen because the external seized, so why not bridging that gap immediately and actually move these teams end to end into the, into the function themselves. So maybe just to continue what Clemence was saying, and this is actually where our, so, Clemence and my journey started to become one joint journey. So Clemence was coming actually from one of these teams who builds their own solutions. I was basically heading the platform team called data warehouse team these days. And in 2019, where (mumbles) become more and more serious, I would say, so more and more people have recognized that this model does not really scale, in 2019, basically the leadership of the company came together and identified data as a key strategic asset. And what we mean by that, that if he leveraged it in a, in a, an appropriate way, it gives us a unique, competitive advantage, which could help us to, to support and actually fully automate our decision making process across the entire value chain. So once we, what we're trying to do now, or what we would be aiming for is that HelloFresh is able to build data products that have a purpose. We're moving away from the idea that it's just a bi-product. We have a purpose why we would like to collect this data. There's a clear business need behind that. And because it's so important to, for the company as a business, we also want to provide them as a trustworthy asset to the rest of the organization. We'd say, this is the best customer experience, but at least in a way that users can easily discover, understand and securely access, high quality data. >> Yeah. So, and, and, and Clemence, when you see Zhamak's writing, you see, you know, she has the four pillars and the principles. As practitioners, you look at that say, okay, hey, that's pretty good thinking. And then now we have to apply it. And that's where the devil meets the details. So it's the for, the decentralized data ownership, data as a product, which we'll talk about a little bit, self-serve, which you guys have spent a lot of time on, and Clemence your wheelhouse, which is, which is governance and a federated governance model. And it's almost like if you, if you achieve the first two, then you have to solve for the second two, it almost creates a new challenges, but maybe you could talk about that a little bit as to how it relates to HelloFresh. >> Yes. So Chris has mentioned that we identified kind of a challenge beforehand and said, how can we actually decentralized and actually empower the different colleagues of ours? And this was more a, we realized that it was more an organizational or a cultural change. And this is something that someone also mentioned. I think ThoughtWorks mentioned one of the white papers, it's more of an organizational or a cultural impact. And we kicked off a phased reorganization, or different phases we're currently on, in the middle of still, but we kicked off different phases of organizational restructuring or reorganization trying to lock this data at scale. And the idea was really moving away from ever growing complex matrix organizations or matrix setups and split between two different things. One is the value creation. So basically when people ask the question, what can we actually do? What should we do? This is value creation and the how, which is capability building, and both are equal in authority. This actually then creates a high urge in collaboration and this collaboration breaks up the different silos that were built. And of course, this also includes different needs of staffing for teams staffing with more, let's say data scientists or data engineers, data professionals into those business domains, enhance, or some more capability building. >> Okay, go ahead. Sorry. >> So back to Zhamak Dehghani. So we, the idea also then crossed over when she published her papers in May, 2019. And we thought, well, the four pillars that she described were around decentralized data ownership, product, data as a product mindset, we have a self-service infrastructure. And as you mentioned, federated computational governance. And this suited very much with our thinking at that point of time to reorganize the different teams and this then that to not only organizational restructure, but also in completely new approach of how we need to manage data, through data. >> Got it. Okay. So your businesses is exploding. The data team was having to become domain experts to many areas, constantly context switching as we said, people started to take things into their own hands. So again, we said classic story, but, but you didn't let it get out of control and that's important. And so we, we actually have a picture of kind of where you're going today and it's evolved into this, Pat, if you could bring up the picture with the, the elephant, here we go. So I will talk a little bit about the architecture. It doesn't show it here, the spreadsheet era, but Christoph, maybe you could talk about that. It does show the Hadoop monolith, which exists today. I think that's in a managed hosting service, but, but you, you preserve that piece of it. But if I understand it correctly, everything is evolving to the cloud. I think you're running a lot of this or all of it in AWS. You've got, everybody's got their own data sources. You've got a data hub, which I think is enabled by a master catalog for discovery and all this underlying technical infrastructure that is, is really not the focus of this conversation today. But the key here, if I understand correctly is these domains are autonomous and that not only this required technical thinking, but really supportive organizational mindset, which we're going to talk about today. But, but Christoph, maybe you could address, you know, at a high level, some of the architectural evolution that you guys went through. >> Yeah, sure. Yeah. Maybe it's also a good summary about the entire history. So as you have mentioned, right, we started in the very beginning, it's a monolith on the operational plan, right? Actually it wasn't just one model it was two, one for the backend and one for the front end. And our analytical plan was essentially a couple of spreadsheets. And I think there's nothing wrong with spreadsheets, but it allows you to store information, it allows you to transform data, it allows you to share this information, it allows you to visualize this data, but all kind of, it's not actually separating concern, right? Every single one tool. And this means that it's obviously not scalable, right? You reach the point where this kind of management's set up in, or data management is in one tool, reached elements. So what we have started is we created our data lake, as we have seen here on our dupe. And just in the very beginning actually reflected very much our operation upon this. On top of that, we used Impala as a data warehouse, but there was not really a distinction between what is our data warehouse and what is our data lakes as the Impala was used as kind of both as a kind of engine to create a warehouse and data lake constructed itself. And this organic growth actually led to a situation. As I think it's clear now that we had the centralized model as, for all the domains that were really lose Kimball, the modeling standards and there's new uniformity we used to actually build, in-house, a base of building materialized use, of use that we have used for the presentation there. There was a lot of duplication of effort. And in the end, essentially the amendments and feedback tool, which helped us to, to improve of what we, have built during the end in a natural, as you said, the lack of trust. And this basically was a starting point for us to understand, okay, how can we move away? And there are a lot of different things that we can discuss of apart from this organizational structure that we have set up here, we have three or four pillars from Zhamak. However, there's also the next, extra question around, how do we implement product, right? What are the implications on that level and I think that is, that's something that we are, that we are currently still in progress. >> Got it. Okay. So I wonder if we could talk about, switch gears a little bit, and talk about the organizational and cultural challenges that you faced. What were those conversations like? And let's, let's dig into that a little bit. I want to get into governance as well. >> The conversations on the cultural change. I mean, yes, we went through a hyper growth through the last year, and obviously there were a lot of new joiners, a lot of different, very, very smart people joining the company, which then results that collaborations got a bit more difficult. Of course, the time zone changes. You have different, different artifacts that you had recreated in documentation that were flying around. So we were, we had to build the company from scratch, right? Of course, this then resulted always this tension, which I described before. But the most important part here is that data has always been a very important factor at HelloFresh, and we collected more of this data and continued to improve, use data to improve the different key areas of our business. Even when organizational struggles like the central (mumbles) struggles, data somehow always helped us to grow through this kind of change, right? In the end, those decentralized teams in our local geographies started with solutions that serve the business, which was very, very important. Otherwise, we wouldn't be at the place where we are today, but they did violate best practices and standards. And I always use the sports analogy, Dave. So like any sport, there are different rules and regulations that need to be followed. These routes are defined by, I'll call it, the sports association. And this is what you can think about other data governance and then our compliance team. Now we add the players to it who need to follow those rules and abide by them. This is what we then call data management. Now we have the different players, the professionals they also need to be trained and understand the strategy and the rules before they can play. And this is what I then called data literacy. So we realized that we need to focus on helping our teams to develop those capabilities and teach the standards for how work is being done to truly drive functional excellence in the different domains. And one of our ambition of our data literacy program for example, is to really empower every employee at HelloFresh, everyone, to make the right data-informed decisions by providing data education that scales (mumbles), and that can be different things. Different things like including data capabilities with, in the learning path for example, right? So help them to create and deploy data products, connecting data, producers, and data consumers, and create a common sense and more understanding of each other's dependencies, which is important. For example, SIS, SLO, state of contracts, et cetera, people get more of a sense of ownership and responsibility. Of course, we have to define what it means. What does ownership means? What does responsibility mean? But we are teaching this to our colleagues via individual learning patterns and help them upscale to use also their shared infrastructure, and those self-service data applications. And of all to summarize, we are still in this progress of learning. We're still learning as well. So learning never stops at Hello Fresh, but we are really trying this to make it as much fun as possible. And in the end, we all know user behavior is changed through positive experience. So instead of having massive training programs over endless courses of workshops, leaving our new joiners and colleagues confused and overwhelmed, we're applying gamification, right? So split different levels of certification where our colleagues, can access, have had access points. They can earn badges along the way, which then simplifies the process of learning and engagement of the users. And this is what we see in surveys, for example, where our employees value this gamification approach a lot and are even competing to collect those learning pet badges, to become the number one on the leaderboard. >> I love the gamification. I mean, we've seen it work so well in so many different industries, not the least of which is crypto. So you've identified some of the process gaps that you, you saw, you just gloss over them. Sometimes I say, pave the cow path. You didn't try to force. In other words, a new architecture into the legacy processes, you really had to rethink your approach to data management. So what did that entail? >> To rethink the way of data management, 100%. So if I take the example of revolution, industrial revolution or classical supply chain revolution, but just imagine that you have been riding a horse, for example, your whole life, and suddenly you can operate a car or you suddenly receive just a complete new way of transporting assets from A to B. So we needed to establish a new set of cross-functional business processes to run faster, drive faster, more robustly, and deliver data products which can be trusted and used by downstream processes and systems. Hence we had a subset of new standards and new procedures that would fall into the internal data governance and compliance sector. With internal, I'm always referring to the data operations around new things like data catalog, how to identify ownership, how to change ownership, how to certify data assets, everything around classical is software development, which we now apply to data. This, this is some old and new thinking, right? Deployment, versioning, QA, all the different things, ingestion policies, the deletion procedures, all the things that software development has been doing, we do it now with data as well. And it's simple terms, it's a whole redesign of the supply chain of our data with new procedures and new processes in asset creation, asset management and asset consumption. >> So data's become kind of the new development kit, if you will. I want to shift gears and talk about the notion of data product, and we have a slide that, that we pulled from your deck. And I'd like to unpack it a little bit. I'll just, if you can bring that up, I'll, I'll read it. A data product is a product whose primary objective is to leverage on data to solve customer problems, where customers are both internal and external. so pretty straightforward. I know you've, you've gone much deeper in your thinking and into your organization, but how do you think about that and how do you determine for instance, who owns what, how did you get everybody to agree? >> I can take that one. Maybe let me start as a data product. So I think that's an ongoing debate, right? And I think the debate itself is the important piece here, right? You mentioned the debate, you've clarified what we actually mean by that, a product, and what is actually the mindset. So I think just from a definition perspective, right? I think we find the common denominator that we say, okay, that our product is something which is important for the company that comes with value. What do you mean by that? Okay. It's a solution to a customer problem that delivers ideally maximum value to the business. And yes, leverage is the power of data. And we have a couple of examples, and I'll hit refresh here, the historical and classical ones around dashboards, for example, to monitor our error rates, but also more sophisticated based for example, to incorporate machine learning algorithms in our recipe recommendation. However, I think the important aspects of a data product is A: there is an owner, right? There's someone accountable for making sure that the product that you're providing is actually served and has maintained. And there are, there's someone who's making sure that this actually keeps the value of what we are promising. Combined with the idea of the proper documentation, like a product description, right? The people understand how to use it. What is this about? And related to that piece is the idea of, there's a purpose, right? We need to understand or ask ourselves, okay, why does a thing exist? Does it provide the value that we think it does? Then it leads in to a good understanding of what the life cycle of the data product and product life cycle. What do we mean? Okay. From the beginning, from the creation, you need to have a good understanding. You need to collect feedback. We need to learn about that, you need to rework, and actually finally, also to think about, okay, when is it time to decommission that piece So overall I think the core of this data product is product thinking 101, right? That we start, the point is, the starting point needs to be the problem and not the solution. And this is essentially what we have seen, what was missing, what brought us to this kind of data spaghetti that we have built there in Rush, essentially, we built it. Certain data assets develop in isolation and continuously patch the solution just to fulfill these ad hoc requests that we got and actually really understanding what the stakeholder needs. And the interesting piece as a results in duplication of (mumbled) And this is not just frustrating and probably not the most efficient way, how the company should work. But also if I build the same data assets, but slightly different assumption across the company and multiple teams that leads to data inconsistency. And imagine the following scenario. You, as a management, for management perspective, you're asking basically a specific question and you get essentially from a couple of different teams, different kinds of graphs, different kinds of data and numbers. And in the end, you do not know which ones to trust. So there's actually much (mumbles) but good. You do not know what actually is it noise for times of observing or is it just actually, is there actually a signal that I'm looking for? And the same as if I'm running an AB test, right? I have a new feature, I would like to understand what is the business impact of this feature? I run that with a specific source and an unfortunate scenario. Your production system is actually running on a different source. You see different numbers. What you have seen in the AB test is actually not what you see then in production, typical thing. Then as you asking some analytics team to actually do a deep dive, to understand where the discrepancies are coming from, worst case scenario again, there's a different kind of source. So in the end, it's a pretty frustrating scenario. And it's actually a waste of time of people that have to identify the root cause of this type of divergence. So in a nutshell, the highest degree of consistency is actually achieved if people are just reusing data assets. And also in the end, the meetup talk they've given, right? We start trying to establish this approach by AB testing. So we have a team, but just providing, or is kind of owning their target metric associated business teams, and they're providing that as a product also to other services, including the AB testing team. The AB testing team can use this information to find an interface say, okay, I'm drawing information for the metadata of an experiment. And in the end, after the assignment, after this data collection phase, they can easily add a graph to a dashboard just grouped by the AB testing barrier. And we have seen that also in other companies. So it's not just a nice dream that we have, right? I have actually looked at other companies maybe looked on search and we established a complete KPI pipeline that was computing all these information and this information both hosted by the team and those that (mumbles) AB testing, deep dives and, and regular reporting again. So just one last second, the, the important piece, Now, why I'm coming back to that is that it requires that we are treating this data as a product, right? If we want to have multiple people using the thing that I am owning and building, we have to provide this as a trust (mumbles) asset and in a way that it's easy for people to discover and to actually work with. >> Yeah. And coming back to that. So this is, to me this is why I get so excited about data mesh, because I really do think it's the right direction for organizations. When people hear data product, they think, "Well, what does that mean?" But then when you start to sort of define it as you did, it's using data to add value that could be cutting costs, that could be generating revenue, it could be actually directly creating a product that you monetize. So it's sort of in the eyes of the beholder, but I think the other point that we've made, is you made it earlier on too, and again, context. So when you have a centralized data team and you have all these P&L managers, a lot of times they'll question the data 'cause they don't own it. They're like, "Well, wait a minute." If it doesn't agree with their agenda, they'll attack the data. But if they own the data, then they're responsible for defending that. And that is a mindset change that's really important. And I'm curious is how you got to that ownership. Was it a top-down or was somebody providing leadership? Was it more organic bottom up? Was it a sort of a combination? How do you decide who owned what? In other words, you know, did you get, how did you get the business to take ownership of the data and what does owning the data actually mean? >> That's a very good question, Dave. I think that one of the pieces where I think we have a lot of learning and basically if you ask me how we could stop the filling, I think that would be the first piece that we need to start. Really think about how that should be approached. If it's staff has ownership, right? That means somehow that the team has the responsibility to host themselves the data assets to minimum acceptable standards. That's minimum dependencies up and down stream. The interesting piece has to be looking backwards. What was happening is that under that definition, this extra process that we have to go through is not actually transferring ownership from a central team to the other teams, but actually in most cases to establish ownership. I make this difference because saying we have to transfer ownership actually would erroneously suggest that the dataset was owned before, but this platform team, yes, they had the capability to make the change, but actually the analytics team, but always once we had the business understand the use cases and what no one actually bought, it's actually expensive, expected. So we had to go through this very lengthy process and establishing ownership, how we have done that as in the beginning, very naively started, here's a document, here are all the data assets, what is probably the nearest neighbor who can actually take care of that. And then we, we moved it over. But the problem here is that all these things is kind of technical debt, right? It's not really properly documented, pretty unstable. It was built in a very inconsistent way over years. And these people that built this thing have already left the company. So this is actually not a nice thing that you want to see and people build up a certain resistance, even if they have actually bought into this idea of domain ownership. So if you ask me these learnings, what needs to happen is first, the company needs to really understand what our core business concept that we have the need to have this mapping from this other core business concept that we have. These are the domain teams who are owning this concept, and then actually linked that to the, the assets and integrate that better, but suppose understanding how we can evolve, actually the data assets and new data builds things new and the, in this piece and the domain, but also how can we address reduction of technical depth and stabilizing what we have already. >> Thank you for that Christoph. So I want to turn a direction here and talk Clemence about governance. And I know that's an area that's passionate, you're passionate about. I pulled this slide from your deck, which I kind of messed up a little bit, sorry for that. But, but, but by the way, we're going to publish a link to the full video that you guys did. So we'll share that with folks, but it's one of the most challenging aspects of data mesh. If you're going to decentralize, you, you quickly realize this could be the wild west, as we talked about all over again. So how are you approaching governance? There's a lot of items on this slide that are, you know, underscore the complexity, whether it's privacy compliance, et cetera. So, so how did you approach this? >> It's yeah, it's about connecting those dots, right? So the aim of the data governance program is to promote the autonomy of every team while still ensuring that everybody has the right interoperability. So when we want to move from the wild west, riding horses to a civilized way of transport, I can take the example of modern street traffic. Like when all participants can maneuver independently, and as long as they follow the same rules and standards, everybody can remain compatible with each other and understand and learn from each other so we can avoid car crashes. So when I go from country to country, I do understand what the street infrastructure means. How do I drive my car? I can also read the traffic lights and the different signals. So likewise, as a business in HelloFresh we do operate autonomously and consequently need to follow those external and internal rules and standards set forth by the tradition in which we operate. So in order to prevent a, a car crash, we need to at least ensure compliance with regulations, to account for societies and our customers' increasing concern with data protection and privacy. So teaching and advocating this imaging, evangelizing this to everyone in the company was a key community or communication strategy. And of course, I mean, I mentioned data privacy, external factors, the same goes for internal regulations and processes to help our colleagues to adapt for this very new environment. So when I mentioned before, the new way of thinking, the new way of dealing and managing data, this of course implies that we need new processes and regulations for our colleagues as well. In a nutshell, then this means that data governance provides a framework for managing our people, the processes and technology and culture around our data traffic. And that governance must come together in order to have this effective program providing at least a common denominator is especially critical for shared data sets, which we have across our different geographies managed, and shared applications on shared infrastructure and applications. And as then consumed by centralized processes, for example, master data, everything, and all the metrics and KPIs, which are also used for a central steering. It's a big change, right? And our ultimate goal is to have this non-invasive federated, automated and computational governance. And for that, we can't just talk about it. We actually have to go deep and use case by use case and QC by PUC and generate learnings and learnings with the different teams. And this would be a classical approach of identifying the target structure, the target status, match it with the current status, by identifying together with the business teams, with the different domains and have a risk assessment, for example, to increase transparency because a lot of teams, they might not even know what kind of situation they might be. And this is where this training and this piece of data literacy comes into place, where we go in and trade based on the findings, based on the most valuable use case. And based on that, help our teams to do this change, to increase their capability. I just told a little bit more, I wouldn't say hand-holding, but a lot of guidance. >> Can I kind of kind of chime in quickly and (mumbled) below me, I mean, there's a lot of governance piece, but I think that is important. And if you're talking about documentation, for example, yes, we can go from team to team and tell these people, hey, you have to document your data assets and data catalog, or you have to establish a data contract and so on and forth. But if we would like to build data products at scale, following actual governance, we need to think about automation, right? We need to think about a lot of things that we can learn from engineering before, and just starts as simple things. Like if we would like to build up trust in our data products, right? And actually want to apply the same rigor and the best practices that we know from engineering. There are things that we can do. And we should probably think about what we can copy. And one example might be so the level of service level agreements, so that level objectives. So the level of indicators, right, that represent on a, on an engineering level, right? Are we providing services? They're representing the promises we make to our customer and to our consumers. These are the internal objectives that help us to keep those promises. And actually these audits of, of how we are tracking ourselves, how we are doing. And this is just one example of where I think the federated governance, governance comes into play, right? In an ideal world, you should not just talk about data as a product, but also data product that's code. That'd be say, okay, as most, as much as possible, right? Give the engineers the tool that they are familiar with, and actually not ask the product managers, for example, to document the data assets in the data catalog, but make it part of the configuration has as, as a, as a CDCI continuous delivery pipeline, as we typically see in other engineering, tasks through it and services maybe say, okay, there is configuration, we can think about PII, we can think about data quality monitoring, we can think about the ingestion data catalog and so on and forth. But I think ideally in a data product goals become a sort of templates that can be deployed and are actually rejected or verified at build time before we actually make them and deploy them to production. >> Yeah so it's like DevOps for data product. So, so I'm envisioning almost a three-phase approach to governance. And you're kind of, it sounds like you're in the early phase of it, call it phase zero, where there's learning, there's literacy, there's training education, there's kind of self-governance. And then there's some kind of oversight, some, a lot of manual stuff going on, and then you, you're trying to process builders at this phase and then you codify it and then you can automate it. Is that fair? >> Yeah. I would rather think, think about automation as early as possible in a way, and yes, it needs to be separate rules, but then actually start actually use case by use case. Is there anything that small piece that we can already automate? If just possible roll that out at the next extended step-by-step. >> Is there a role though, that adjudicates that? Is there a central, you know, chief state officer who's responsible for making sure people are complying or is it, how do you handle it? >> I mean, from a, from a, from a platform perspective, yes. This applies in to, to implement certain pieces, that we are saying are important and actually would like to implement, however, that is actually working very closely with the governance department, So it's Clemence's piece to understand that defy the policies that needs to be implemented. >> So good. So Clemence essentially, it's, it's, it's your responsibility to make sure that the policy is being followed. And then as you were saying, Christoph, you want to compress the time to automation as fast as possible. Is that, is that-- >> Yeah, so it's a really, it's a, what needs to be really clear is that it's always a split effort, right? So you can't just do one or the other thing, but there is some that really goes hand in hand because for the right information, for the right engineering tooling, we need to have the transparency first. I mean, code needs to be coded. So we kind of need to operate on the same level with the right understanding. So there's actually two things that are important, which is one it's policies and guidelines, but not only that, because more importantly or equally important is to align with the end-user and tech teams and engineering and really bridge between business value business teams and the engineering teams. >> Got it. So just a couple more questions, because we got to wrap up, I want to talk a little bit about the business outcome. I know it's hard to quantify and I'll talk about that in a moment, but, but major learnings, we've got some of the challenges that, that you cited. I'll just put them up here. We don't have to go detailed into this, but I just wanted to share with some folks, but my question, I mean, this is the advice for your peers question. If you had to do it differently, if you had a do over or a Mulligan, as we like to say for you, golfers, what, what would you do differently? >> I mean, I, can we start with, from, from the transformational challenge that understanding that it's also high load of cultural exchange. I think this is, this is important that a particular communication strategy needs to be put into place and people really need to be supported, right? So it's not that we go in and say, well, we have to change into, towards data mash, but naturally it's the human nature, nature, nature, we are kind of resistant to change, right? And (mumbles) uncomfortable. So we need to take that away by training and by communicating. Chris, you might want to add something to that. >> Definitely. I think the point that I've also made before, right? We need to acknowledge that data mesh it's an architectural scale, right? If you're looking for something which is necessary by huge companies who are vulnerable, that are product at scale. I mean, Dave, you mentioned that right, there are a lot of advantages to have a centralized team, but at some point it may make sense to actually decentralize here. And at this point, right, if you think about data mesh, you have to recognize that you're not building something on a green field. And I think there's a big learning, which is also reflected on the slide is, don't underestimate your baggage. It's typically is you come to a point where the old model doesn't work anymore. And as had a fresh write, we lost the trust in our data. And actually we have seen certain risks of slowing down our innovation. So we triggered that, this was triggering the need to actually change something. So at this transition applies that you took, we have a lot of technical depth accumulated over years. And I think what we have learned is that potentially we have, de-centralized some assets too early. This is not actually taking into account the maturity of the team. We are actually investigating too. And now we'll be actually in the face of correcting pieces of that one, right? But I think if you, if you, if you start from scratch, you have to understand, okay, is all my teams actually ready for taking on this new, this new capability? And you have to make sure that this is decentralization. You build up these capabilities and the teams, and as Clemence has mentioned, right? Make sure that you take the, the people on your journey. I think these are the pieces that also here it comes with this knowledge gap, right? That we need to think about hiring literacy, the technical depth I just talked about. And I think the, the last piece that I would add now, which is not here on the slide deck is also from our perspective, we started on the analytical layer because it was kind of where things are exploding, right? This is the bit where people feel the pain. But I think a lot of the efforts that we have started to actually modernize the current stage and data products, towards data mesh, we've understood that it always comes down basically to a proper shape of our operational plan. And I think what needs to happen is I think we got through a lot of pains, but the learning here is this needs to really be an, a commitment from the company. It needs to have an end to end. >> I think that point, that last point you made is so critical because I, I, I hear a lot from the vendor community about how they're going to make analytics better. And that's not, that's not unimportant, but, but true data product thinking and decentralized data organizations really have to operationalize in order to scale it. So these decisions around data architecture and organization, they're fundamental and lasting, it's not necessarily about an individual project ROI. They're going to be projects, sub projects, you know, within this architecture. But the architectural decision itself is organizational it's cultural and, and what's the best approach to support your business at scale. It really speaks to, to, to what you are, who you are as a company, how you operate and getting that right, as we've seen in the success of data-driven companies is, yields tremendous results. So I'll, I'll, I'll ask each of you to give, give us your final thoughts and then we'll wrap. Maybe. >> Just can I quickly, maybe just jumping on this piece, what you have mentioned, right, the target architecture. If you talk about these pieces, right, people often have this picture of (mumbled). Okay. There are different kinds of stages. We have (incomprehensible speech), we have actually a gesture layer, we have a storage layer, transformation layer, presentation data, and then we are basically putting a lot of technology on top of that. That's kind of our target architecture. However, I think what we really need to make sure is that we have these different kinds of views, right? We need to understand what are actually the capabilities that we need to know, what new goals, how does it look and feel from the different kinds of personas and experience view. And then finally that should actually go to the, to the target architecture from a technical perspective. Maybe just to give an outlook what we are planning to do, how we want to move that forward. Yes. Actually based on our strategy in the, in the sense of we would like to increase the maturity as a whole across the entire company. And this is kind of a framework around the business strategy and it's breaking down into four pillars as well. People meaning the data culture, data literacy, data organizational structure and so on. If you're talking about governance, as Clemence had actually mentioned that right, compliance, governance, data management, and so on, you're talking about technology. And I think we could talk for hours for that one it's around data platform, data science platform. And then finally also about enablements through data. Meaning we need to understand data quality, data accessibility and applied science and data monetization. >> Great. Thank you, Christoph. Clemence why don't you bring us home. Give us your final thoughts. >> Okay. I can just agree with Christoph that important is to understand what kind of maturity people have, but I understand we're at the maturity level, where a company, where people, our organization is, and really understand what does kind of, it's just kind of a change applies to that, those four pillars, for example, what needs to be tackled first. And this is not very clear from the very first beginning (mumbles). It's kind of like green field, you come up with must wins to come up with things that you really want to do out of theory and out of different white papers. Only if you really start conducting the first initiatives, you do understand that you are going to have to put those thoughts together. And where do I miss out on one of those four different pillars, people process technology and governance, but, and then that can often the integration like doing step by step, small steps, by small steps, not pulling the ocean where you're capable, really to identify the gaps and see where either you can fill the gaps or where you have to increase maturity first and train people or increase your tech stack. >> You know, HelloFresh is an excellent example of a company that is innovating. It was not born in Silicon Valley, which I love. It's a global company. And, and I got to ask you guys, it seems like it's just an amazing place to work. Are you guys hiring? >> Yes, definitely. We do. As, as mentioned right as well as one of these aspects distributing and actually hiring as an entire company, specifically for data. I think there are a lot of open roles, so yes, please visit or our page from data engineering, data, product management, and Clemence has a lot of roles that you can speak to about. But yes. >> Guys, thanks so much for sharing with theCUBE audience, you're, you're pioneers, and we look forward to collaborations in the future to track progress, and really want to thank you for your time. >> Thank you very much. >> Thank you very much Dave. >> And thank you for watching theCUBE's startup showcase made possible by AWS. This is Dave Volante. We'll see you next time. (cheerful music)

Published Date : Sep 15 2021

SUMMARY :

and the internal team it had the world in your field. Maybe take over the first and the plant acquisition And as you expand your TAM, the flexibility to grow So that for the team meant and so the lines of business, and so on started really to and the flip side of that say the data to the experts So it's the for, And the idea was really moving away Okay, go ahead. And as you mentioned, federated computational governance. is really not the focus of And in the end, and talk about the organizational And in the end, we all know user behavior not the least of which is crypto. So if I take the example of revolution, of the new development kit, And also in the end, So it's sort of in the the company needs to really but it's one of the most So the aim of the data governance and actually not ask the the early phase of it, that we can already automate? that defy the policies that the time to automation on the same level with the about the business outcome. So it's not that we go in and say, well, efforts that we have started to I hear a lot from the vendor in the sense of we would like Clemence why don't you bring us home. fill the gaps or where you And, and I got to ask you guys, that you can speak to about. collaborations in the future to track And thank you for watching

ENTITIES

Entity	Category	Confidence
Dave	PERSON	0.99+
Christoph	PERSON	0.99+
Chris	PERSON	0.99+
Christoph Sawade	PERSON	0.99+
2015	DATE	0.99+
Zhamak Dehghani	PERSON	0.99+
Youfoodz	ORGANIZATION	0.99+
Dave Volante	PERSON	0.99+
Clemence Chee	PERSON	0.99+
2019	DATE	0.99+
Norway	LOCATION	0.99+
2017	DATE	0.99+
AWS	ORGANIZATION	0.99+
May, 2019	DATE	0.99+
UK	LOCATION	0.99+
HelloFresh	ORGANIZATION	0.99+
Clemence	PERSON	0.99+
Silicon Valley	LOCATION	0.99+
Australia	LOCATION	0.99+
100%	QUANTITY	0.99+
US	LOCATION	0.99+
July	DATE	0.99+
two	QUANTITY	0.99+
Clemence W. Chee	PERSON	0.99+
Two	QUANTITY	0.99+
TAM	ORGANIZATION	0.99+
one	QUANTITY	0.99+
three	QUANTITY	0.99+
Hello Fresh	ORGANIZATION	0.99+
first piece	QUANTITY	0.99+
one tool	QUANTITY	0.99+
last year	DATE	0.99+
last week	DATE	0.99+
two things	QUANTITY	0.99+
Zhamak	PERSON	0.99+
first	QUANTITY	0.99+
two years later	DATE	0.99+
Pat	PERSON	0.99+
second two	QUANTITY	0.99+
one last second	QUANTITY	0.99+
Green Chef	ORGANIZATION	0.99+
One	QUANTITY	0.98+
first two	QUANTITY	0.98+
one example	QUANTITY	0.98+
both	QUANTITY	0.98+
one model	QUANTITY	0.98+
theCUBE	ORGANIZATION	0.97+
four pillars	QUANTITY	0.97+
Every Plate	ORGANIZATION	0.97+
today	DATE	0.97+
each	QUANTITY	0.97+
earlier this year	DATE	0.97+

COMMUNICATIONS V1 | CLOUDERA

>>Hi today, I'm going to talk about network analytics and what that means for, for telecommunications as we go forward. Um, thinking about, uh, 5g, what the impact that's likely to have on, on network analytics and the data requirement, not just to run the network and to understand the network a little bit better. Um, but also to, to inform the rest of the operation of the telecommunications business. Um, so as we think about where we are in terms of network analytics and what that is over the last 20 years, the telecommunications industry has evolved its management infrastructure, uh, to abstract away from some of the specific technologies in the network. So what do we mean by that? Well, uh, in the, in the initial, uh, telecommunications networks were designed, there were management systems that were built in, um, eventually fault management systems, uh, assurance systems, provisioning systems, and so on were abstracted away. >>So it didn't matter what network technology had, whether it was a Nokia technology or Erickson technology or Huawei technology or whatever it happened to be. You could just look at your fault management system, understand where false, what happened as we got into the last sort of 10, 15 years or so. Telecommunication service providers become became more sophisticated in terms of their approach to data analytics and specifically network analytics, and started asking questions about why and what if in relation to their network performance and network behavior. And so network analytics as a, as a bit of an independent function was born and over time, more and more data began to get loaded into the network analytics function. So today just about every carrier in the world has a network analytics function that deals with vast quantities of data in big data environments that are now being migrated to the cloud. >>As all telecommunications carriers are migrating as many it workloads as possible, um, to the cloud. So what are the things that are happening as we migrate to the cloud that drive, uh, uh, enhancements in use cases and enhancements and scale, uh, in telecommunications network analytics? Well, 5g is the big thing, right? So 5g, uh, it's not just another G in that sense. I mean, in some cases, in some senses, it is 5g means greater bandwidth, lower latency and all those good things. So, you know, we can watch YouTube videos with less interference and, and less sluggish bandwidth and so on and so forth. But 5g is really about the enterprise and enterprise services. Transformation, 5g is more secure, kind of a network, but 5g is also a more pervasive network 5g, a fundamentally different network topology than previous generations. So there's going to be more masts and that means that you can have more pervasive connectivity. >>Uh, so things like IOT and edge applications, autonomous cars, smart cities, these kinds of things, um, are all much better served because you've got more masks that of course means that you're going to have a lot more data as well. And we'll get to that. The second piece is immersive digital services. So with more masks, with more connectivity, with lower latency with higher man, the potential, uh, is, is, is, is immense for services innovation. And we don't know what those services are going to be. We know that technologies like augmented reality, virtual reality, things like this have great potential. Um, but we, we have yet to see where those commercial applications are going to be, but the innovation and the innovation potential for 5g is phenomenal. Um, it certainly means that we're going to have a lot more, uh, edge devices, um, uh, and that again is going to lead to an increase in the amount of data that we have available. >>And then the idea of pervasive connectivity when it comes to smart, smart cities, uh, autonomous, autonomous currents, um, uh, integrated traffic management systems, um, all of this kind of stuff, those of those kind of smart environments thrive where you've got this kind of pervasive connectivity, this persistent, uh, connection to the network. Um, again, that's going to drive, um, um, uh, more innovation. And again, because you've got these new connected devices, you're going to get even more data. So this rise, this exponential rise in data is really what's driving the change in, in network analytics. And there are four major vectors that are driving this increase in data in terms of both volume and in terms of speed. So the first is more physical elements. So we said already that 5g networks are going to have a different apology. 5g networks will have more devices, more and more masks. >>Um, and so with more physical elements in the network, you're going to get more physical data coming off those physical networks. And so that needs to be aggregated and collected and managed and stored and analyzed and understood when, so that we can, um, have a better understanding as to why things happened the way they do, why the network behaves in which they do in, in, in, in ways that it does and why devices that are connected to the network. And ultimately of course, consumers, whether they be enterprises or retail customers, um, behave in the way they do in relation to their interaction within our edge nodes and devices, we're going to have a, uh, an explosion in terms of the number of devices. We've already seen IOT devices with your different kinds of trackers and, uh, and, and sensors that are hanging off the edge of the network, whether it's to make buildings smarter car smarter, or people smarter, um, in, in terms of having the, the, the measurements and the connectivity and all that sort of stuff. >>So the numbers of devices on the agent beyond the age, um, are going to be phenomenal. One of the things that we've been trying to with as an industry over the last few years is where does the telco network end, and where does the enterprise, or even the consumer network begin. You used to be very clear that, you know, the telco network ended at the router. Um, but now it's not, it's not that clear anymore because in the enterprise space, particularly with virtualized networking, which we're going to talk about in a second, um, you start to see end to end network services being deployed. Um, uh, and so are they being those services in some instances are being managed by the service provider themselves, and in some cases by the enterprise client, um, again, the line between where the telco network ends and where the enterprise or the consumer network begins, uh, is not clear. >>Uh, so, so those edge, the, the, the proliferation of devices at the age, um, uh, in terms of, um, you know, what those devices are, what the data yield is and what the policies are, their need to govern those devices, um, in terms of security and privacy, things like that, um, that's all going to be really, really important virtualized services. We just touched on that briefly. One of the big, big trends that's happening right now is not just the shift of it operations onto the cloud, but the shift of the network onto the cloud, the virtualization of network infrastructure, and that has two major impacts. First of all, it means that you've got the agility and all of the scale, um, uh, benefits that you get from migrating workloads to the cloud, the elasticity and the growth and all that sort of stuff. But arguably more importantly for the telco, it means that with a virtualized network infrastructure, you can offer entire networks to enterprise clients. >>So if you're selling to a government department, for example, is looking to stand up a system for certification of, of, you know, export certification, something like that. Um, you can not just sell them the connectivity, but you can sell them the networking and the infrastructure in order to serve that entire end to end application. You could sentence, you could offer them in theory, an entire end-to-end communications network, um, and with 5g network slicing, they can even have their own little piece of the 5g bandwidth that's been allocated against the carrier, um, uh, and, and have a complete end to end environment. So the kinds of services that can be offered by telcos, um, given virtualize network infrastructure, uh, are, are many and varied. And it's a, it's a, it's a, um, uh, an outstanding opportunity. But what it also means is that the number of network elements virtualized in this case is also exploding. >>That means the amount of data that we're getting on, uh, informing us as to how those network elements are behaving, how they're performing, um, uh, is, is, is going to go up as well. And then finally, AI complexity. So on the demand side, um, while historically, uh, um, network analytics, big data, uh, has been, has been driven by, um, returns in terms of data monetization, uh, whether that's through cost avoidance, um, or service assurance, uh, or even revenue generation through data monetization and things like that. AI is transforming telecommunications and every other industry, the potential for autonomous operations, uh, is extremely attractive. And so understanding how the end-to-end telecommunication service delivering delivery infrastructure works, uh, is essential, uh, as a training ground for AI models that can help to automate a huge amount of telecommunications operating, um, processes. So the AI demand for data is just going through the roof. >>And so all of these things combined to mean big data is getting explosive. It is absolutely going through the roof. So that's a huge thing that's happening. So as telecommunications companies around the world are looking at their network analytics infrastructure, which was initially designed for service insurance primarily, um, and how they migrate that to the cloud. These things are impacting on those decisions because you're not just looking at migrating a workload to operate in the cloud that used to work in the, in the data center. Now you're looking at, um, uh, migrating a workload, but also expanding the use cases in that work and bear in mind, many of those, those are going to need to remain on prem. So they'll need to be within a private cloud or at best a hybrid cloud environment in order to satisfy a regulatory jurisdictional requirements. So let's talk about an example. >>So LGU plus is a Finastra fantastic service provider in Korea. Um, huge growth in that business over the last, uh, over the last 10, 15 years or so. Um, and obviously most people will be familiar with LG, the electronics brand, maybe less so with, uh, with LG plus, but they've been doing phenomenal work. And we're the first, uh, business in the world who launch commercial 5g in 2019. And so a huge milestone that they achieved. And at the same time they deploy the network real-time analytics platform or in rep, uh, from a combination of Cloudera and our partner calmer. Now, um, there were a number of things that were driving, uh, the requirement for it, for the, for the analytics platform at the time. Um, clearly the 5g launch was that was the big thing that they had in mind, but there were other things that re so within the 5g launch, um, uh, they were looking for, for visibility of services, um, and service assurance and service quality. >>So, you know, what services have been launched? How are they being taken up? What are the issues that are arising, where are the faults happening? Um, where are the problems? Because clearly when you launch a new service, but then you want to understand and be on top of the issues as they arise. Um, so that was really, really important. The second piece was, and, you know, this is not a new story to any telco in the world, right. But there are silos in operation. Uh, and so, um, taking advantage of, um, or eliminating redundancies through the process, um, of, of digital transformation, it was really important. And so particular, the two silos between wired and the wireless sides of the business come together so that there would be an integrated network management system, um, for, uh, for LGU plus, as they rolled out 5g. So eliminating redundancy and driving cost savings through the, the integration of the silos is really, really important. >>And that's a process and the people thing every bit, as much as it is a systems and a data thing. So, um, another big driver and the fourth one, you know, we've talked a little bit about some of these things, right? 5g brings huge opportunity for enterprise services, innovation. So industry 4.0 digital experience, these kinds of use cases, um, are very important in the south Korean marketing and in the, um, in the business of LGU plus. And so, uh, um, looking at AI and how can you apply AI to network management? Uh, again, there's a number of use cases, really, really exciting use cases that have gone live now, um, in LG plus since, uh, since we did this initial deployment and they're making fantastic strides there, um, big data analytics for users across LGU plus, right? So it's not just for, um, uh, it's not just for the immediate application of 5g or the support or the 5g network. >>Um, but also for other data analysts and data scientists across the LGU plus business network analytics, while primarily it's primary it's primary use case is around network management, um, LGU plus, or, or network analytics, um, has applications across the entire business, right? So, um, you know, for customer churn or next best offer for understanding customer experience and customer behavior really important there for digital advertising, for product innovation, all sorts of different use cases and departments within the business needed access to this information. So collaboration sharing across the network, the real-time network analytics platform, um, it was very important. And then finally, as I mentioned, LG group is much bigger than just LG plus it's because the electronics and other pieces, and they had launched a major group wide digital transformation program in 2019, and still being a part of that was, well, some of them, the problems that they were looking to address. >>Um, so first of all, the integration of wired and wireless data service data sources, and so getting your assurance data sources, your network, data sources, uh, and so on integrated with is really, really important scale was massive for them. Um, you know, they're talking about billions of transactions in under a minute, uh, being processed, um, and hundreds of terabytes per day. So, uh, you know, phenomenal scale, uh, that needed to be available out of the box as it were, um, real time indicators and alarms. And there was lots of KPIs and thresholds set that, you know, w to make, make it to meet certain criteria, certain standards, um, customer specific, real time analysis of 5g, particularly for the launch root cause analysis, an AI based prediction on service, uh, anomalies and service service issues was, was, was a core use case. Um, as I talked about already the provision of service of data services across the organization, and then support for 5g, uh, served the business service, uh, impact, uh, was extremely important. >>So it's not just understand well, you know, that you have an outage in a particular network element, but what is the impact on the business of LGU plus, but also what is the impact on the business of the customer, uh, from an outage or an anomaly or a problem on, on, on the network. So being able to answer those kinds of questions really, really important, too. And as I said, between Cloudera and Kamarck, uh, uh, and LGU plus, uh, really themselves an intrinsic part of the solution, um, uh, this is, this is what we, we ended up building. So a big complicated architecture space. I really don't want to go into too much detail here. Um, uh, you can see these things for yourself, but let me skip through it really quickly. So, first of all, the key data sources, um, you have all of your wireless network information, other data sources. >>This is really important because sometimes you kind of skip over this. There are other systems that are in place like the enterprise data warehouse that needed to be integrated as well, southbound and northbound interfaces. So we get our data from the network and so on, um, and network management applications through file interfaces. CAFCA no fire important technologies. And also the RDBMS systems that, uh, you know, like the enterprise data warehouse that we're able to feed that into the system. And then northbound, um, you know, we spoke already about me making network analytics services available across the enterprise. Um, so, uh, you know, uh, having both the file and the API interface available, um, for other systems and other consumers across the enterprise is very important. Um, lots of stuff going on then in the platform itself to petabytes and persistent storage, um, Cloudera HDFS, 300 nodes for the, the raw data storage, um, uh, and then, uh, could do for real time storage for real-time indicator analysis, alarm generation, um, uh, and other real time, um, processes. >>Uh, so there, that was the, the core of the solution, uh, spark processes for ETL key quality indicators and alarming, um, and also a bunch of work done around, um, data preparation, data generation for transferal to, to third party systems, um, through the northbound interfaces, um, uh, Impala, API queries, um, for real-time systems, uh, there on the right hand side, and then, um, a whole bunch of clustering classification, prediction jobs, um, through the, uh, the, the, the, the ML processes, the machine learning processes, uh, again, another key use case, and we've done a bunch of work on that. And, um, I encourage you to have a look at the Cloudera website for more detail on some of the work that we did here. Um, so this is some pretty cool stuff. Um, and then finally, just the upstream services, some of these there's lots more than, than, than simply these ones, but service assurance is really, really important. So SQM cm and SED grade. So the service quality management customer experience, autonomous controllers, uh, really, really important consumers of, of the, of the real-time analytics platform, uh, and your conventional service assurance, um, functions like faulted performance management. Uh, these things are as much consumers of the information and the network analytics platform as they are providers of data to the network, uh, analytics >>Platform. >>Um, so some of the specific use cases, uh, that, uh, have been, have been stood up and that are delivering value to this day and lots of more episodes, but these are just three that we pulled out. Um, so first of all, um, uh, sort of specific monitoring and customer quality analysis, Karen response. So again, growing from the initial 5g launch and then broadening into broader services, um, understanding where there are the, where there are issues so that when people complaining, when people have an issue, um, that, um, uh, that we can answer the, the concerns of the client, um, in a substantive way, um, uh, AI functions around root cause analysis or understanding why things went wrong when they went wrong. Um, uh, and also making recommendations as to how to avoid those occurrences in the future. Uh, so we know what preventative measures can be taken. Um, and then finally the, uh, the collaboration function across LGU plus extremely important and continues to be important to this day where data is shared throughout the enterprise, through the API Lira through file interfaces and other things, and through interface integrations with, uh, with upstream systems. >>So, um, that's kind of the, the, uh, real quick run through of LGU plus the numbers are just stave staggering. Um, you know, we've seen, uh, upwards of a billion transactions in under 40 seconds being, um, uh, being tested. Um, and, and we've gone beyond those thresholds now, already, um, and we're started and, and, and, and this isn't just a theoretical sort of a benchmarking test or something like that. We're seeing these kinds of volumes of data and not too far down the track. So, um, with those things that I mentioned earlier with the proliferation of, of, um, of network infrastructure, uh, in the 5g context with virtualized elements, with all of these other bits and pieces are driving massive volumes of data towards the, uh, the, the, the network analytics platform. So phenomenal scale. Um, this is just one example we work with, with service providers all over the world is over 80% of the top 100 telecommunication service providers run on Cloudera. >>They use Cloudera in the network, and we're seeing those customers, all migrating legacy cloud platforms now onto CDP onto the Cloudera data platform. Um, they're increasing the, the, the jobs that they do. So it's not just warehousing, not just ingestion ETL, and moving into things like machine learning. Um, and also looking at new data sources from places like NWTF the network data analytics function in 5g, or the management and orchestration layer in, in software defined networks, network, function, virtualization. So, you know, new use cases coming in all the time, new data sources coming in all the time growth in, in, in, in the application scope from, as we say, from edge to AI. Um, and so it's, it's really exciting to see how the, the, the, the footprint is growing and how, uh, the applications in telecommunications are really making a difference in, in facilitating, um, network transformation. And that's covering that. That's me covered for today. I hope you found that helpful, um, by all means, please reach out, uh, there's a couple of links here. You can follow me on Twitter. You can connect to the telecommunications page, reach out to me directly at Cloudera. I'd love to answer your questions, um, uh, and, uh, and talk to you about how big data is transforming networks, uh, and how network transformation is, is accelerating telcos, uh, throughout >>Jamie Sharath with Liga data, I'm primarily on the delivery side of the house, but I also support our new business teams. I'd like to spend a minute really just kind of telling you about the legal data, where basically a Silicon valley startup, uh, started in 2014, and, uh, our lead iron, our executive team, basically where the data officers at Yahoo before this, uh, we provide managed data services, and we provide products that are focused on telcos. So we have some experience in non telco industry, but our focus for the last seven years or so is specifically on telco. So again, something over 200 employees, we have a global presence in north America, middle east Africa, Asia, and Europe. And we have folks in all of those places, uh, I'd like to call your attention to the, uh, the middle really of the screen there. So here is where we have done some partnership with Cloudera. >>So if you look at that and you can see we're in Holland and Jamaica, and then a lot to throughout Africa as well. Now, the data fabric is the product that we're talking about. And the data fabric is basically a big data type of data warehouse with a lot of additional functionality involved. The data fabric is comprised of, uh, some something called a flare, which we'll talk about in a minute below there, and then the Cloudera data platform underneath. So this is how we're partnering together. We, uh, we, we have this tool and it's, uh, it's functioning and delivering in something over 10 up. So flare now, flare is a piece of that legal data IP. The rest is there. And what flare does is that basically pulls in data, integrates it to an event streaming platform. It's, uh, it is the engine behind the data fabric. >>Uh, it's also a decisioning platform. So in real time, we're able to pull in data. We're able to run analytics on it, and we're able to alert are, do whatever is needed in a real-time basis. Of course, a lot of clients at this point are still sending data in batch. So it handles that as well, but we call that a CA picture Sanchez. Now Sacho is a very interesting app. It's an AI analytics app for executives. What it is is it runs on your mobile phone. It ties into your data. Now this could be the data fabric, but it couldn't be a standalone product. And basically it allows you to ask, you know, human type questions to say, how are my gross ads last week? How are they comparing against same time last week before that? And even the same time 60 days ago. So as an executive or as an analyst, I can pull it up and I can look at it instantly in a meeting or anywhere else without having to think about queries or anything like that. >>So that's pretty much for us at legal data, not really to set the context of where we are. So this is a traditional telco environments. So you see the systems of record, you see the cloud, you see OSS and BSS data. So one of the things that the next step above which calls we call the system of intelligence of the data fabric does, is it mergers that BSS and OSS data. So the longer we have any silos or anything that's separated, it's all coming into one area to allow business, to go in or allow data scientists go in and do that. So if you look at the bottom line, excuse me, of the, uh, of the system of intelligence, you can see that flare is the tools that pulls in the data. So it provides even streaming capabilities. It preserves entity states, so that you can go back and look at it state at any time. >>It does stream analytics that is as the data is coming in, it can perform analytics on it. And it also allows real-time decisioning. So that's something that, uh, that's something that business users can go in and create a system of, uh, if them's, it looks very much like the graph database, where you can create a product that will allow the user to be notified if a certain condition happens. So for instance, a bundle, so a real-time offer or user is succinct to run out of is ongoing, and an offer can be sent to him right on the fly. And that's set up by the business user as opposed to programmers, uh, data infrastructure. So the fabric has really three areas. That data is persistent, obviously there's the data lake. So the data lake stores that level of granularity that is very deep years and years of history, data, scientists like that, uh, and, uh, you know, for a historical record keeping and requirements from the government, that data would be stored there. >>Then there's also something we call the business semantics layer and the business semantics layer contains something over 650 specific telco KPIs. These are initially from PM forum, but they also are included in, uh, various, uh, uh, mobile operators that we've delivered at. And we've, we've grown that. So that's there for business data lake is there for data scientists, analytical stores, uh, they can be used for many different reasons. There are a lot of times RDBMS is, are still there. So these, this, this basically platform, this cloud they're a platform can tie into analytical data stores as well via flair access and reporting. So graphic visualizations, API APIs are a very key part of it. A third-party query tools, any kind of grid tools can be used. And those are the, of course, the, uh, the ones that are highly optimized and allow, you know, search of billions of records. >>And then if you look at the top, it's the systems of engagement, then you might vote this use cases. So teleco reporting, hundreds of KPIs that are, that are generated for users, segmentation, basically micro to macro segmentation, segmentation will play a key role in a use case. We talked about in a minute monetization. So this helps teleco providers monetize their specific data, but monetize it in. Okay, how to, how do they make money off of it, but also how might you leverage this data to engage with another client? So for instance, in some where it's allowed a DPI is used, and the fabric tracks exactly where each person goes each, uh, we call it a subscriber, goes within his, uh, um, uh, internet browsing on the, on the four or 5g. And, uh, the, all that data is stored. Uh, whereas you can tell a lot of things where the segment, the profile that's being used and, you know, what are they propensity to buy? Do they spend a lot of time on the Coca-Cola page? There are buyers out there that find that information very valuable, and then there's signs of, and we spoke briefly about Sanchez before that sits on top of the fabric or it's it's alone. >>So, so the story really that we want to tell is, is one, this is, this is one case out of it. This is a CVM type of case. So there was a mobile operator out there that was really offering, you know, packages, whether it's a bundle or whether it's a particular tool to subscribers, they, they were offering kind of an abroad approach that it was not very focused. It was not depending on the segments that were created around the profiling earlier, uh, the subscriber usage was somewhat dated and this was causing a lot of those. A lot of those offers to be just basically not taken and, and not, not, uh, audited. Uh, there was limited segmentation capabilities really before the, uh, before the, uh, fabric came in. Now, one of the key things about the fabric is when you start building segments, you can build that history. >>So all of that data stored in the data lake can be used in terms of segmentation. So what did we do about that? The, the, the envy and, oh, the challenge this, uh, we basically put the data fabric in and the data fabric was running Cloudera data platform and that, uh, and that's how we team up. Uh, we facilitated the ability to personalize campaign. So what that means is, uh, the segments that were built and that user fell within that segment, we knew exactly what his behavior most likely was. So those recommendations, those offers could be created then, and we enable this in real time. So real-time ability to even go out to the CRM system and gather further information about that. All of these tools, again, we're running on top of the Cloudera data platform, uh, what was the outcome? Willie, uh, outcome was that there was a much more precise offer given to the client that is, that was accepted, no increase in cross sell and upsell subscriber retention. >>Uh, our clients came back to us and pointed out that, uh, it was 183% year on year revenue increase. Uh, so this is a, this is probably one of the key use cases. Now, one thing to really mention is there are hundreds and hundreds of use cases running on the fabric. And I would even say thousands. A lot of those have been migrated. So when the fabric is deployed, when they bring the Cloudera and the legal data solution in there's generally a legacy system that has many use cases. So many of those were, were migrated virtually all of them in pen, on put on the cloud. Uh, another issue is that new use cases are enabled again. So when you get this level of granularity and when you have campaigns that can now base their offers on years of history, as opposed to 30 days of history, the campaigns campaign management response systems, uh, are, are, uh, are enabled quite a bit to do all, uh, to be precise in their offers. Okay. >>Okay. So this is a technical slide. Uh, one of the things that we normally do when we're, when we're out there talking to folks, is we talk and give an overview and that last little while, and then we give a deep technical dive on all aspects of it. So sometimes that deep dive can go a couple of hours. I'm going to do this slide and a couple of minutes. So if you look at it, you can see over on the left, this is the, uh, the sources of the data. And they go through this tool called flare that runs on the cloud. They're a data platform, uh, that can either be via cues or real-time cues, or it can be via a landing zone, or it can be a data extraction. You can take a look at the data quality that's there. So those are built in one of the things that flare does is it has out of the box ability to ingest data sources and to apply the data quality and validation for telco type sources. >>But one of the reasons this is fast to market is because throughout those 10 or 12, uh, opcos that we've done with Cloudera, where we have already built models, so models for CCN, for air for, for most mediation systems. So there's not going to be a type of, uh, input that we haven't already seen are very rarely. So that actually speeds up deployment very quickly. Then a player does the transformations, the, uh, the metrics, continuous learning, we call it continuous decisioning, uh, API access. Uh, we, uh, you know, for, for faster response, we use distributed cash. I'm not going to go too deeply in there, but the layer in the business semantics layer again, are, are sitting on top of the Cloudera data platform. You see the Kafka CLU, uh, Q1, the right as well. >>And all of that, we're calling the fabric. So the fabric is Cloudera data platform and the cloud and flair and all of this runs together. And, and by the way, there've been many, many, many, many hundreds of hours testing flare with Cloudera and, uh, and the whole process, the results, what are the results? Well, uh, there are, there are four I'm going to talk about, uh, we saw the one for the, it was called my pocket pocket, but it's a CDM type, a use case. Uh, the subscribers of that mobile operator were 14 million plus there was a use case for 24 million plus that a year on year revenue was 130%, uh, 32 million plus for 38%. These are, um, these are different CVM pipe, uh, use cases, as well as network use cases. And then there were 44%, uh, telco with 76 million subscribers. So I think that there are a lot more use cases that we could talk about, but, but in this case, this is the ones we're looking at, uh, again, 183%. This is something that we find consistently. And these figures come from our, uh, our actual end client. How do we unlock the full potential of this? Well, I think to start is to arrange a meeting and, uh, it would be great to, to, uh, for you to reach out to me or to Anthony. Uh, we're working at the junction on this, and we can set up a, uh, we can set up a meeting and we can go through this initial meeting. And, uh, I think that's the very beginning. Uh, again, you can get additional information from Cloudera website and from the league of data website, Anthony, that's the story. Thank you. >>No, that's great. Jeremy, thank you so much. It's a, it's, it's wonderful to go deep. And I know that there are hundreds of use cases being deployed in MTN, um, but great to go deep on one. And like you said, it can, once you get that sort of architecture in place, you can do so many different things. The power of data is tremendous, but it's great to be able to see how you can, how you can track it end to end from collecting the data, processing it, understanding it, and then applying it in a commercial context and bringing actual revenue back into the business. So there is your ROI straight away. Now you've got a platform that you can transform your business on. That's, that's, it's a tremendous story, Jamie, and thank you for your part. Sure. Um, that's a, that's, that's our story for today. Like Jamie says, um, please do flee, uh, feel free to reach out to us. Um, the, the website addresses are there and our contact details, and we'd be delighted to talk to you a little bit more about some of the other use cases, perhaps, um, and maybe about your own business and, uh, and how we might be able to make it, make it perform a little better. So thank you.

Published Date : Aug 4 2021

SUMMARY :

Um, thinking about, uh, So it didn't matter what network technology had, whether it was a Nokia technology or Erickson technology the cloud that drive, uh, uh, enhancements in use cases uh, and that again is going to lead to an increase in the amount of data that we have available. So the first is more physical elements. And so that needs to be aggregated and collected and managed and stored So the numbers of devices on the agent beyond the age, um, are going to be phenomenal. the agility and all of the scale, um, uh, benefits that you get from migrating So the kinds of services So on the demand side, um, So they'll need to be within a private cloud or at best a hybrid cloud environment in order to satisfy huge growth in that business over the last, uh, over the last 10, 15 years or so. And so particular, the two silos between And so, uh, um, the real-time network analytics platform, um, it was very important. Um, so first of all, the integration of wired and wireless data service data sources, So, first of all, the key data sources, um, you have all of your wireless network information, And also the RDBMS systems that, uh, you know, like the enterprise data warehouse that we're able to feed of the information and the network analytics platform as they are providers of data to the network, Um, so some of the specific use cases, uh, Um, you know, we've seen, Um, and also looking at new data sources from places like NWTF the network data analytics So here is where we have done some partnership with So if you look at that and you can see we're in Holland and Jamaica, and then a lot to throughout And even the same time So the longer we have any silos data, scientists like that, uh, and, uh, you know, for a historical record keeping and requirements of course, the, uh, the ones that are highly optimized and allow, the segment, the profile that's being used and, you know, what are they propensity to buy? Now, one of the key things about the fabric is when you start building segments, So all of that data stored in the data lake can be used in terms of segmentation. So when you get this level of granularity and when you have campaigns that can now base their offers So if you look at it, you can see over on the left, this is the, uh, the sources of the data. So there's not going to be a type of, uh, input that we haven't already seen are very rarely. So the fabric is Cloudera data platform and the cloud uh, and how we might be able to make it, make it perform a little better.

ENTITIES

Entity	Category	Confidence
Jamie	PERSON	0.99+
Jeremy	PERSON	0.99+
Holland	LOCATION	0.99+
Jamie Sharath	PERSON	0.99+
Anthony	PERSON	0.99+
Korea	LOCATION	0.99+
38%	QUANTITY	0.99+
Cloudera	ORGANIZATION	0.99+
2014	DATE	0.99+
2019	DATE	0.99+
183%	QUANTITY	0.99+
Europe	LOCATION	0.99+
24 million	QUANTITY	0.99+
14 million	QUANTITY	0.99+
LG	ORGANIZATION	0.99+
second piece	QUANTITY	0.99+
30 days	QUANTITY	0.99+
Jamaica	LOCATION	0.99+
Nokia	ORGANIZATION	0.99+
Huawei	ORGANIZATION	0.99+
today	DATE	0.99+
Yahoo	ORGANIZATION	0.99+
130%	QUANTITY	0.99+
32 million	QUANTITY	0.99+
Asia	LOCATION	0.99+
last week	DATE	0.99+
Erickson	ORGANIZATION	0.99+
Finastra	ORGANIZATION	0.99+
three	QUANTITY	0.99+
thousands	QUANTITY	0.99+
Africa	LOCATION	0.99+
north America	LOCATION	0.99+
telco	ORGANIZATION	0.99+
Silicon valley	LOCATION	0.99+
first	QUANTITY	0.99+
each person	QUANTITY	0.99+
Willie	PERSON	0.99+
10	QUANTITY	0.99+
44%	QUANTITY	0.99+
over 80%	QUANTITY	0.99+
one	QUANTITY	0.98+
76 million subscribers	QUANTITY	0.98+
60 days ago	DATE	0.98+
over 200 employees	QUANTITY	0.98+
LGU plus	ORGANIZATION	0.98+
Cloudera	TITLE	0.98+
Sacho	TITLE	0.98+
middle east Africa	LOCATION	0.97+
First	QUANTITY	0.97+
Liga data	ORGANIZATION	0.97+
four major vectors	QUANTITY	0.97+
under 40 seconds	QUANTITY	0.97+
YouTube	ORGANIZATION	0.97+
one example	QUANTITY	0.97+
One	QUANTITY	0.97+
two silos	QUANTITY	0.97+
each	QUANTITY	0.96+
Karen	PERSON	0.96+
one case	QUANTITY	0.96+
billions of records	QUANTITY	0.96+
three areas	QUANTITY	0.96+
under a minute	QUANTITY	0.95+
CAFCA	ORGANIZATION	0.95+
one thing	QUANTITY	0.95+
both	QUANTITY	0.94+
12	QUANTITY	0.94+
LG plus	ORGANIZATION	0.94+
Twitter	ORGANIZATION	0.94+
one area	QUANTITY	0.93+
fourth one	QUANTITY	0.93+
hundreds and	QUANTITY	0.92+
a year	QUANTITY	0.92+

Josh Rogers, Syncsort | CUBEConversation, November 2018

>> From the SiliconANGLE media office in Boston, Massachusetts it's theCUBE. Now, here's your host Stu Miniman. >> Hi, I'm Stu Miniman and welcome to our Boston area studio. I'm happy to welcome back to the program a multi-time guest, Josh Rogers, who's the CEO of Syncsort. Josh, great to see ya. >> Great to see you. Thanks for having me. >> Alright so, Syncsort is a company that I would say is, you guys are deep in the data ocean. Data is at the center of everything. When Wikibon, when we did our predictions everything whether you're talking about cloud, whether you're talking about infrastructure, of course everything like IoT and Edge, it is at the center of it. I want you to help start off is there's this term, big iron, big data. Help explain to us what that is and what that means to both Syncsort and your customers. >> Sure yeah, so we like to talk about Syncsort as the leader in big iron to big data and it's a it's a positioning that we've chosen for the firm because we think it represents the value proposition that we bring to our customers but we also think it represents a collection of use cases that are really at the top of the agenda of CIOs today. And really we talk about it in two areas. The first is a recognition that large enterprises still run mission critical workloads on systems that they've built over the last 20, 30, 40 years. Those systems leverage mainframe computing, they leveraging IBM i or AS400 and they spent trillions of dollars building those systems and they still deliver core workloads that power their businesses. So mission number one is that these firms want to make sure that they optimize those environments. They run them as efficiently as possible. They can't go down. They've got the proper security kind of protocols around them and of course that situation's always changing as workloads grow and change on these environments. So first is how do I optimize the systems that while they may be mature, they are still mission critical. The second is a recognition that most of the critical data assets for our customers are created in these systems. These are the systems that execute the transactions and as a result have core information around the results of the firm, the firm's customers, et cetera. So second value proposition is how do I maximize the value of that data that gets produced in those systems which tends to be a focus on liberating it, making a copy of it and moving it into next generation analytic systems. And then you look at the technical requirements of that it turns out that it's hard. I'm taking data from systems that were created 50 years ago and I'm integrating it with systems that were created five years ago. And so we've got a special set of expertise and solutions that allow customers to both optimize these old systems and maximize the value data produced in those systems. >> You bring up some really good points. I've been talking the last couple of years to people about how do I really wrap my arms around my data and we're talking about a multi-cloud world and where we have pockets of information trapped. That's a challenge. So it's not just about my data center and Amazon. It's like oh wait, I've got all these SaaS deployments and I think it's probably, it's a blind spot that I had had as to sure, right, you've got companies that have let's call them legacy systems, ones that they've got a lot investment but these are mission critical, these are the ones that it is not easy to modernize them but if I can get access to the data and put this into these next generation systems it sounds like you kind of free that data and allow that to be leveraged much easier. >> That's right, that's right and we, what we try to do is focus on what are the next generation trends in data and how are they going to intersect with these older systems. And so that started as big data but it includes cloud and the multi-cloud. It includes real-time and IoT. It includes thing like Blockchain. We're really scanning the horizon for what are these kind of generational shifts in terms of how am I going to leverage data and how do we get really tight on the use cases that our customers are gonna need. So I'll integrate those new technologies with these old investments. >> Josh, I'd love to hear what you're seeing from customers. So we've talked to you at some of the big data shows. I know we've spoken to you at the Splunk shows. I felt like what we as an industry got bogged down in some of the tools for a couple of years. While Wikibon, we did the first market forecast on big data everybody was like oh, Hadoop Hadoop Hadoop and we're like well, Hadoop will catalyze a lot of things and companies will rod a lot of things but Hadoop itself will be a small piece of the market and we've started to see some consolidation in that market. So data and the value that I get out of the data is the important thing. So what are your customers focused on? How do they get from their traditional data warehouses to a more modern? What are the challenges that they're dealing with and where are you engaging with them? >> Right, sure. So I mean one of the challenges they do have is this explosion of kind of options. Am I doing things in Hadoop? What is Hadoop at this point? Which projects actually constitute Hadoop? So what repository I'm gonna use. Am I gonna use Hive? Am I gonna use something, am I gonna use MongoDB, Elastic? What are, what's the repository I'm targeting? Generally what we see is that each of those has, and a long list of additional repositories, has a role to play for the specific use case. And then how am I going to get the data there and integrate it and then get the data out and deliver insights? And that stack of technologies and tools is pretty intimidating. And so we see customers starting to coalesce around some market leaders in that space. The merger of Hortonworks and Cloudera I think was a very good thing for the industry. It just simplifies the life of the customer in terms of making decisions in confidence in that stack. It certainly simplifies our life as a partner of those firms and I think it will help accelerate maturity in that tech stack. And so I think we're starting to see pockets of maturation which I think will accelerate customers' investments in leveraging these next generation technologies. That then creates a big opportunity for us because now it's becoming real. Now I really have to get on a real-time basis my data out of my mainframe or my IBM i system into these next generation repositories and it turns out that's technically a challenge and so what we're seeing in our businesses real acceleration of our big data solutions against what I would say production-targeted workloads and projects, which is great. >> Alright, M&A, you got a always really active in this space. We've done ThinkSort for many years so we've watched some of the changes along the way. I believe you've got some news to share regarding M&A activity and there's also some recent stuff to tap in the last year. Maybe bring us up to speed. >> Sure so we've made two announcements. We made an announcement in the last few weeks and then one very recently that I'd like to share. The first is about two months ago we struck up a developmental relationship with IBM around their B2B collaboration portfolio and this product set really gives us exposure to integration styles between businesses. Historically we've been focused on integration within a business and so we really like the exposure to that. More importantly, it intersects with one of these next generational data themes around Blockchain and we believe there's a huge opportunity to help be a leader and how do you take Blockchain infrastructure and integrate it to these existing systems. So we're really excited to partner with IBM on that front. And IBM obviously is making huge investments there. >> Before we got, what's Syncsort's play there when it comes to Blockchain? We have definitely talked to IBM quite a bit about Blockchain, Hyperledger, everything going into there. So maybe give a little more color there. >> Sure, so look, we still think that production workloads on Blockchain are a few years out and we see a lot of pilot activity. So I think people are still trying to understand the specific use cases they're gonna deliver real value. But one thing is for certain, that as customers start to stand up production workloads on the Blockchain they're going to need to integrate what's happening in that new infrastructure with these traditional systems that are still managing the large majority of their transactions. And how do I add data to the Blockchain? How do I verify data on the Blockchain? How do I improve the quality of data on the Blockchain? How do I pull data off of the Blockchain? We think there's a really important role for us to play around understanding the specifics of those use cases, how they intersect with some of these legacy systems and how we provide tailored solutions that are best in class. And it's one of the reasons, it's one of the primary reasons we've struck up the relationship with IBM but also joined Hyperledger. So hopefully that gives you a little bit more context. >> That's great. >> The more recent announcement I want to make is that we've acquired a company called Eview and Eview is a terrific leader in the machine data integration space. They have a number of solutions that are complementary to what we've done with our iron string product and what we're trying to do there is support as many use cases as possible for people to maximize the value of that they can get out of machine data, particularly as it relates to older systems like mainframe and IBM i. And what this acquisition does is it allows us to take another step forward in terms of the value proposition that we offer our customers. One specific use case where Eview's been a leader that we're very excited about is integration with ServiceNow. And you can think of ServiceNow as kind of a next generation platform that we to date have not had integration with. This acquisition gives us that integration. It also gives us a set of technology and talent that we can put towards accelerating our overall big data plans. And so we're really excited about having the Evue team join the Syncsort family and what we can deliver for customers. >> Yeah great great. Absolutely, companies like ServiceNow and Workday, huge amounts of data there, are seeing a lot of it. Dave Alonte's been at the ServiceNow knowledge show with theCUBE for a number of years. Really interesting. Seems like this acquisition ties well in with I believe it was Vision that a year ago? >> Well so it ties in mostly with our iron string product. >> Okay. >> Now Vision contributed to the iron string product in that that gave us the expertise to deliver integration for IBM i log data into next generation analytic platforms like Splunk and Elastic. So we had built a product that was focused on delivering mainframe data in real-time to those platforms. Vision gave us both real-time capability and a huge franchise in the IBM i space. Eview builds on that and gives us additional capability in terms of delivering data to new repositories like ServiceNow. >> Great, maybe step back for a second. Give us kind of some of the speeds and feeds of Syncsort itself. Memento the company, you've been CEO for a while now. Tell us how we're doing. >> Yeah, we're doing well. We're having a record year. It's important to actually recognize that in September we celebrated our 50th anniversary. So I think we're a bit unusual in terms of our heritage. Having said that, we've never driven more innovation than we have over the last 12 months. We have tripled the size of the business over the last three years since I've been CEO. We've quadrupled the employee base. And we will continue to see I think rapid growth given the opportunity we set and we see in this big iron to big data space. >> Yeah, Josh, you talk about that. When I look at okay, a 50-year-old company. We talked about data quite a bit differently 50 years ago. What is the digital transformation today? What does that mean for Syncsort? What does that mean for your customers? Help put us in context. >> Yeah, I mean, it kind of goes back to this original positioning which is, the largest banks int he world, the largest telecommunications vendors in the world, healthcare, government, you pick the industry, they built a set of systems that they still run today over the last four or five decades. Those systems tend to produce the most important data of that enterprise, not the only data you want to analyze, but it tends to be that reference data that makes everything else, allows you to make sense of everything else. And as you think about how am I gonna analyze that data, how am I gonna maximize the value of that data there is a need to integrate the data and move it off of those platforms and into these next generation platforms. And if you look at the way a vSAN file was designed for computing requirements in 1970 it turns out it's really different than the way that you would design a file type JSON or a file for Impala. And so kind of knitting that together takes a lot of deep expertise on both sides of the equation and we uniquely have that expertise and are solving that. And what we've seen is as new technologies continue to come to market, which we refer to as the next wave, that our enterprise customer base of 7,000 customers needs a partner that can say how do I take advantage of that new technology trend in the context of the past 30, 40, 50 years of investment I've made in mission critical systems and how do I support the key integration use cases? And that's what we've determined where we can make a difference in the market is focusing on what are those use cases and how do we deliver differentiate solutions to solve 'em that help both our customers and these partners. >> Absolutely, it's always great to talk about some of the new stuff but you need to meet the customers where they are, get to that data where it is and help move it forward. Alright, Josh, why don't you give it the final words? Kind of broadly open. Big challenges, opportunities, what's exciting you as you look forward kind of the next six months? >> Yeah, so we'll continue to make investments in cloud, in data governance, in supporting real-time data streaming and in security. Those are the areas that we'll be focused on driving innovation and delivering additional capability to our customers. Some of that will come through taking technologies like Eview or like the B2B products and enhancing them for specific use cases where they intersect those things. It will also be additional investments from an acquisition perspective in those domains and you can count on Syncsort to continue to expand the value proposition that it is delivering to its customers both through new technology introductions but also through additional integration with these next generation platforms. So we're really excited I mean, we believe our strategy is working. It's led to record results in our 50th year and we think we've got many years to run with this strategy. >> Alright well Josh Rogers, CEO of Syncsort. Congratulations on the progress. New acquisition, deeper partnership with IBM and I look forward to tracking the updates. >> Thanks so much. Appreciate the opportunity. >> Alright, and thank you as always for joining. I'm Stu Miniman. Thanks for watching theCUBE. (upbeat electronic music)

Published Date : Nov 27 2018

SUMMARY :

From the SiliconANGLE media office and welcome to our Boston area studio. Great to see you. Data is at the center of everything. and of course that situation's always changing and allow that to be leveraged much easier. and how are they going to intersect What are the challenges that they're dealing with So I mean one of the challenges they do have and there's also some recent stuff to tap in the last year. and integrate it to these existing systems. We have definitely talked to IBM quite a bit that are still managing the large majority that are complementary to what we've done Dave Alonte's been at the ServiceNow knowledge show and a huge franchise in the IBM i space. Memento the company, you've been CEO for a while now. and we see in this big iron to big data space. What is the digital transformation today? and how do I support the key integration use cases? some of the new stuff and we think we've got many years to run with this strategy. and I look forward to tracking the updates. Appreciate the opportunity. Alright, and thank you as always for joining.

ENTITIES

Entity	Category	Confidence
Josh	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Dave Alonte	PERSON	0.99+
Josh Rogers	PERSON	0.99+
Syncsort	ORGANIZATION	0.99+
Hortonworks	ORGANIZATION	0.99+
Boston	LOCATION	0.99+
Amazon	ORGANIZATION	0.99+
Stu Miniman	PERSON	0.99+
September	DATE	0.99+
1970	DATE	0.99+
7,000 customers	QUANTITY	0.99+
November 2018	DATE	0.99+
50th year	QUANTITY	0.99+
Cloudera	ORGANIZATION	0.99+
two areas	QUANTITY	0.99+
last year	DATE	0.99+
second	QUANTITY	0.99+
Wikibon	ORGANIZATION	0.99+
first	QUANTITY	0.99+
Boston, Massachusetts	LOCATION	0.99+
Eview	ORGANIZATION	0.99+
Hadoop	TITLE	0.99+
each	QUANTITY	0.99+
a year ago	DATE	0.99+
one	QUANTITY	0.99+
Vision	ORGANIZATION	0.98+
both	QUANTITY	0.98+
both sides	QUANTITY	0.98+
50th anniversary	QUANTITY	0.98+
Hyperledger	ORGANIZATION	0.98+
trillions of dollars	QUANTITY	0.98+
five years ago	DATE	0.98+
two announcements	QUANTITY	0.98+
MongoDB	TITLE	0.98+
Edge	ORGANIZATION	0.97+
50 years ago	DATE	0.97+
30	QUANTITY	0.96+
today	DATE	0.96+
JSON	TITLE	0.95+
ServiceNow	TITLE	0.94+
50-year-old	QUANTITY	0.93+
Elastic	TITLE	0.92+
Evue	ORGANIZATION	0.9+
40 years	QUANTITY	0.9+
M&A	ORGANIZATION	0.9+
AS400	COMMERCIAL_ITEM	0.89+
next six months	DATE	0.88+
one thing	QUANTITY	0.87+
Splunk	TITLE	0.87+
40, 50 years	QUANTITY	0.86+
last 12 months	DATE	0.86+
ThinkSort	ORGANIZATION	0.84+
Hive	TITLE	0.84+

Brent Compton, Red Hat | theCUBE NYC 2018

>> Live from New York, it's theCUBE, covering theCUBE New York City 2018. Brought to you by SiliconANGLE Media and its ecosystem partners. >> Hello, everyone, welcome back. This is theCUBE live in New York City for theCUBE NYC, #CUBENYC. This is our ninth year covering the big data ecosystem, which has now merged into cloud. All things coming together. It's really about AI, it's about developers, it's about operations, it's about data scientists. I'm John Furrier, my co-host Dave Vellante. Our next guest is Brent Compton, Technical Marketing Director for Storage Business at Red Hat. As you know, we cover Red Hat Summit and great to have the conversation. Open source, DevOps is the theme here. Brent, thanks for joining us, thanks for coming on. >> My pleasure, thank you. >> We've been talking about the role of AI and AI needs data and data needs storage, which is what you do, but if you look at what's going on in the marketplace, kind of an architectural shift. It's harder to find a cloud architect than it is to find diamonds these days. You can't find a good cloud architect. Cloud is driving a lot of the action. Data is a big part of that. What's Red Hat doing in this area and what's emerging for you guys in this data landscape? >> Really, the days of specialists are over. You mentioned it's more difficult to find a cloud architect than find diamonds. What we see is the infrastructure, it's become less about compute as storage and networking. It's the architect that can bring the confluence of those specialties together. One of the things that we see is people bringing their analytics workloads onto the common platforms where they've been running the rest of their enterprise applications. For instance, if they're running a lot of their enterprise applications on AWS, of course, they want to run their analytics workloads in AWS and that's EMRs long since in the history books. Likewise, if they're running a lot of their enterprise applications on OpenStack, it's natural that they want to run a lot of their analytics workloads on the same type of dynamically provisioned infrastructure. Emerging, of course, we just announced on Monday this week with Hortonworks and IBM, if they're running a lot of their enterprise applications on a Kubernetes substrate like OpenShift, they want to run their analytics workloads on that same kind of agile infrastructure. >> Talk about the private cloud impact and hybrid cloud because obviously we just talked to the CEO of Hortonworks. Normally it's about early days, about Hadoop, data legs and then data planes. They had a good vision. They're years into it, but I like what Hortonworks is doing. But he said Kubernetes, on a data show Kubernetes. Kubernetes is a multi-cloud, hybrid cloud concept, containers. This is really enabling a lot of value and you guys have OpenShift which became very successful over the past few years, the growth has been phenomenal. So congratulations, but it's pointing to a bigger trend and that is that the infrastructure software, the platform as a service is becoming the middleware, the glue, if you will, and Kubernetes and containers are facilitating a new architecture for developers and operators. How important is that with you guys, and what's the impact of the customer when they think, okay I'm going to have an agile DevOps environment, workload portability, but do I have to build that out? You mentioned people don't have to necessarily do that anymore. The trend has become on-premise. What's the impact of the customer as they hear Kubernetese and containers and the data conversation? >> You mentioned agile DevOps environment, workload portability so one of the things that customers come to us for is having that same thing, but infrastructure agnostic. They say, I don't want to be locked in. Love AWS, love Azure, but I don't want to be locked into those platforms. I want to have an abstraction layer for my Kubernetese layer that sits on top of those infrastructure platforms. As I bring my workloads, one-by-one, custom DevOps from a lift and shift of legacy apps onto that substrate, I want to have it be independent, private cloud or public cloud and, time permitting, we'll go into more details about what we've seen happening in the private cloud with analytics as well, which is effectively what brought us here today. The pattern that we've discovered with a lot of our large customers who are saying, hey, we're running OpenStack, they're large institutions that for lots of reasons they store a lot of their data on-premises saying, we want to use the utility compute model that OpenStack gives us as well as the shared data context that Ceph gives us. We want to use that same thing for our analytics workload. So effectively some of our large customers taught us this program. >> So they're building infrastructure for analytics essentially. >> That's what it is. >> One of the challenges with that is the data is everywhere. It's all in silos, it's locked in some server somewhere. First of all, am I overstating that problem and how are you seeing customers deal with that? What are some of the challenges that they're having and how are you guys helping? >> Perfect lead in, in fact, one of our large government customers, they recently sent us an unsolicited email after they deployed the first 10 petabytes in a deca petabyte solution. It's OpenStack based as well as Ceph based. Three taglines in their email. The first was releasing the lock on data. The second was releasing the lock on compute. And the third was releasing the lock on innovation. Now, that sounds a bit buzzword-y, but when it comes from a customer to you. >> That came from a customer? Sounds like a marketing department wrote that. >> In the details, as you know, traditional HDFS clusters, traditional Hadoop clusters, sparklers or whatever, HDFS is not shared between clusters. One of our large customers has 50 plus analytics clusters. Their data platforms team employ a maze of scripts to copy data from one cluster to the other. And if you are a scientist or an engineer, you'd say, I'm trying to obtain these types of answers, but I need access to data sets A, B, C, and D, but data sets A and B are only on this cluster. I've got to go contact the data platforms team and have them copy it over and ensure that it's up-to-date and in sync so it's messy. >> It's a nightmare. >> Messy. So that's why the one customer said releasing the lock on data because now it's in a shared. Similar paradigm as AWS with EMR. The data's in a shared context, an S3. You spin up your analytics workloads on AC2. Same paradigm discussion as with OpenStack. Your spinning up your analytics workloads via OpenStack virtualization and their sourcing is shared data context inside of Ceph, S3 compatible Ceph so same architecture. I love his last bit, the one that sounds the most buzzword-y which was releasing lock on innovation. And this individual, English was not this person's first language so love the word. He said, our developers no longer fear experimentation because it's so easy. In minutes they can spin up an analytics cluster with a shared data context, they get the wrong mix of things they shut it down and spin it up again. >> In previous example you used HDFS clusters. There's so many trip wires, right. You can break something. >> It's fragile. >> It's like scripts. You don't want to tinker with that. Developers don't want to get their hand slapped. >> The other thing is also the recognition that innovation comes from data. That's what my takeaway is. The customer saying, okay, now we can innovate because we have access to the data, we can apply intelligence to that data whether it's machine intelligence or analytics, et cetera. >> This the trend in infrastructure. You mentioned the shared context. What other observations and learnings have you guys come to as Red Hat starts to get more customer interactions around analytical infrastructure. Is it an IT problem? You mentioned abstracting the way different infrastructures, and that means multi-cloud's probably setup for you guys in a big way. But what does that mean for a customer? If you had to explain infrastructure analytics, what needs to get done, what does the customer need to do? How do you describe that? >> I love the term that industry uses of multi-tenant workload isolation with shared data context. That's such a concise term to describe what we talk to our customers about. And most of them, that's what they're looking for. They've got their data scientist teams that don't want their workloads mixed in with the long running batch workloads. They say, listen, I'm on deadline here. I've got an hour to get these answers. They're working with Impala. They're working with Presto. They iterate, they don't know exactly the pattern they're looking for. So having to take a long time because their jobs are mixed in with these long MapReduce jobs. They need to be able to spin up infrastructure, workload isolation meaning they have their own space, shared context, they don't want to be placing calls over to the platform team saying, I need data sets C, D, and E. Could you please send them over? I'm on deadline here. That phrase, I think, captures so nicely what customers are really looking to do with their analytics infrastructure. Analytics tools, they'll still do their thing, but the infrastructure underneath analytics delivering this new type of agility is giving that multi-tenant workload isolation with shared data context. >> You know what's funny is we were talking at the kickoff. We were looking back nine years. We've been at this event for nine years now. We made prediction there will be no Red Hat of big data. John, years ago said, unless it's Red Hat. You guys got dragged into this by your customers really is how it came about. >> Customers and partners, of course with your recent guest from Hortonworks, the announcement that Red Hat, Hortonworks, and IBM had on Monday of this week. Dialing up even further taking the agility, okay, OpenStack is great for agility, private cloud, utility based computing and storage with OpenStack and Ceph, great. OpenShift dials up that agility another notch. Of course, we heard from the CEO of Hortonworks how much they love the agility that a Kubernetes based substrate provides their analytics customers. >> That's essentially how you're creating that sort of same-same experience between on-prem and multi-cloud, is that right? >> Yeah, OpenShift is deployed pervasively on AWS, on-premises, on Azure, on GCE. >> It's a multi-cloud world, we see that for sure. Again, the validation was at VMworld. AWS CEO, Andy Jassy announced RDS which is their product on VMware on-premises which they've never done. Amazon's never done any product on-premises. We were speculating it would be a hardware device. We missed that one, but it's a software. But this is the validation, seamless cloud operations on-premise in the cloud really is what people want. They want one standard operating model and they want to abstract away the infrastructure, as you were saying, as the big trend. The question that we have is, okay, go to the next level. From a developer standpoint, what is this modern developer using for tools in the infrastructure? How can they get that agility and spinning up isolated, multi-tenant infrastructure concept all the time? This is the demand we're seeing, that's an evolution. Question for Red Hat is, how does that change your partnership strategy because you mentioned Rob Bearden. They've been hardcore enterprise and you guys are hardcore enterprise. You kind of know the little things that customers want that might not be obvious to people: compliance, certification, a decade of support. How is Red Hat's partnership model changing with this changing landscape, if you will? You mentioned IBM and Hortonworks release this week, but what in general, how does the partnership strategy look for you? >> The more it changes, the more it looks the same. When you go back 20 years ago, what Red Hat has always stood for is any application on any infrastructure. But back in the day it was we had n-thousand of applications that were certified on Red Hat Linux and we ran on anybody's server. >> Box. >> Running on a box, exactly. It's a similar play, just in 2018 in the world of hybrid, multi-cloud architectures. >> Well, you guys have done some serious heavy lifting. Don't hate me for saying this, but you're kind of like the mules of the industry. You do a lot of stuff that nobody either wants to do or knows how to do and it's really paid off. You just look at the ascendancy of the company, it's been amazing. >> Well, multi-cloud is hard. Look at what it takes to do multi-cloud in DevOps. It's not easy and a lot of pretenders will fall out of the way, you guys have done well. What's next for you guys? What's on the horizon? What's happening for you guys this next couple months for Red Hat and technology? Any new announcements coming? What's the vision, what's happening? >> One of the announcements that you saw last week, was Red Hat, Cloudera, and Eurotech as analytics in the data center is great. Increasingly, the world's businesses run on data-driven decisions. That's great, but analytics at the edge for more realtime industrial automation, et cetera. Per the announcements we did with Cloudera and Eurotech about the use of, we haven't even talked about Red Hat's middleware platforms, such as AMQ Streams now based on Kafka, a Kafka distribution, Fuze, an integration master effectively bringing Red Hat technology to the edge of analytics so that you have the ability to do some processing in realtime before back calling all the way back to the data center. That's an area that you'll also see is pushing some analytics to the edge through our partnerships such as announced with Cloudera and Eurotech. >> You guys got the Red Hat Summit coming up next year. theCUBE will be there, as usual. It's great to cover Red Hat. Thanks for coming on theCUBE, Brent. Appreciate it, thanks for spending the time. We're here in New York City live. I'm John Furrier, Dave Vallante, stay with us. All day coverage today and tomorrow in New York City. We'll be right back. (upbeat music)

Published Date : Sep 12 2018

SUMMARY :

Brought to you by SiliconANGLE Media Open source, DevOps is the theme here. Cloud is driving a lot of the action. One of the things that we see is people and that is that the infrastructure software, the shared data context that Ceph gives us. So they're building infrastructure One of the challenges with that is the data is everywhere. And the third was releasing the lock on innovation. That came from a customer? In the details, as you know, I love his last bit, the one that sounds the most buzzword-y In previous example you used HDFS clusters. You don't want to tinker with that. that innovation comes from data. You mentioned the shared context. I love the term that industry uses of You guys got dragged into this from Hortonworks, the announcement that Yeah, OpenShift is deployed pervasively on AWS, You kind of know the little things that customers want But back in the day it was we had n-thousand of applications in the world of hybrid, multi-cloud architectures. You just look at the ascendancy of the company, What's on the horizon? One of the announcements that you saw last week, You guys got the Red Hat Summit coming up next year.

ENTITIES

Entity	Category	Confidence
Dave Vallante	PERSON	0.99+
Dave Vellante	PERSON	0.99+
IBM	ORGANIZATION	0.99+
John	PERSON	0.99+
Brent Compton	PERSON	0.99+
AWS	ORGANIZATION	0.99+
John Furrier	PERSON	0.99+
Eurotech	ORGANIZATION	0.99+
Hortonworks	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
Brent	PERSON	0.99+
New York City	LOCATION	0.99+
2018	DATE	0.99+
Red Hat	ORGANIZATION	0.99+
Rob Bearden	PERSON	0.99+
nine years	QUANTITY	0.99+
Andy Jassy	PERSON	0.99+
last week	DATE	0.99+
first language	QUANTITY	0.99+
Three taglines	QUANTITY	0.99+
SiliconANGLE Media	ORGANIZATION	0.99+
first	QUANTITY	0.99+
tomorrow	DATE	0.99+
second	QUANTITY	0.99+
One	QUANTITY	0.99+
Cloudera	ORGANIZATION	0.99+
next year	DATE	0.99+
third	QUANTITY	0.99+
New York	LOCATION	0.99+
Impala	ORGANIZATION	0.99+
Monday this week	DATE	0.99+
VMworld	ORGANIZATION	0.98+
one cluster	QUANTITY	0.98+
Red Hat Summit	EVENT	0.98+
ninth year	QUANTITY	0.98+
one	QUANTITY	0.98+
OpenStack	TITLE	0.98+
today	DATE	0.98+
NYC	LOCATION	0.97+
20 years ago	DATE	0.97+
Kubernetese	TITLE	0.97+
Kafka	TITLE	0.97+
First	QUANTITY	0.96+
this week	DATE	0.96+
Red Hat	TITLE	0.95+
English	OTHER	0.95+
Monday of this week	DATE	0.94+
OpenShift	TITLE	0.94+
one standard	QUANTITY	0.94+
50 plus analytics clusters	QUANTITY	0.93+
Ceph	TITLE	0.92+
Azure	TITLE	0.92+
GCE	TITLE	0.9+
Presto	ORGANIZATION	0.9+
agile DevOps	TITLE	0.89+
theCUBE	ORGANIZATION	0.88+
DevOps	TITLE	0.87+

Kickoff | theCUBE NYC 2018

>> Live from New York, it's theCUBE covering theCUBE New York City 2018. Brought to you by SiliconANGLE Media and its ecosystem partners. (techy music) >> Hello, everyone, welcome to this CUBE special presentation here in New York City for CUBENYC. I'm John Furrier with Dave Vellante. This is our ninth year covering the big data industry, starting with Hadoop World and evolved over the years. This is our ninth year, Dave. We've been covering Hadoop World, Hadoop Summit, Strata Conference, Strata Hadoop. Now it's called Strata Data, I don't know what Strata O'Reilly's going to call it next. As you all know, theCUBE has been present for the creation at the Hadoop big data ecosystem. We're here for our ninth year, certainly a lot's changed. AI's the center of the conversation, and certainly we've seen some horses come in, some haven't come in, and trends have emerged, some gone away, your thoughts. Nine years covering big data. >> Well, John, I remember fondly, vividly, the call that I got. I was in Dallas at a storage networking world show and you called and said, "Hey, we're doing "Hadoop World, get over there," and of course, Hadoop, big data, was the new, hot thing. I told everybody, "I'm leaving." Most of the people said, "What's Hadoop?" Right, so we came, we started covering, it was people like Jeff Hammerbacher, Amr Awadallah, Doug Cutting, who invented Hadoop, Mike Olson, you know, head of Cloudera at the time, and people like Abi Mehda, who at the time was at B of A, and some of the things we learned then that were profound-- >> Yeah. >> As much as Hadoop is sort of on the back burner now and people really aren't talking about it, some of the things that are profound about Hadoop, really, were the idea, the notion of bringing five megabytes of code to a petabyte of data, for example, or the notion of no schema on write. You know, put it into the database and then figure it out. >> Unstructured data. >> Right. >> Object storage. >> And so, that created a state of innovation, of funding. We were talking last night about, you know, many, many years ago at this event this time of the year, concurrent with Strata you would have VCs all over the place. There really aren't a lot of VCs here this year, not a lot of VC parties-- >> Mm-hm. >> As there used to be, so that somewhat waned, but some of the things that we talked about back then, we said that big money and big data is going to be made by the practitioners, not by the vendors, and that's proved true. I mean... >> Yeah. >> The big three Hadoop distro vendors, Cloudera, Hortonworks, and MapR, you know, Cloudera's $2.5 billion valuation, you know, not bad, but it's not a $30, $40 billion value company. The other thing we said is there will be no Red Hat of big data. You said, "Well, the only Red Hat of big data might be "Red Hat," and so, (chuckles) that's basically proved true. >> Yeah. >> And so, I think if we look back we always talked about Hadoop and big data being a reduction, the ROI was a reduction on investment. >> Yeah. >> It was a way to have a cheaper data warehouse, and that's essentially-- Well, what did we get right and wrong? I mean, let's look at some of the trends. I mean, first of all, I think we got pretty much everything right, as you know. We tend to make the calls pretty accurately with theCUBE. Got a lot of data, we look, we have the analytics in our own system, plus we have the research team digging in, so you know, we pretty much get, do a good job. I think one thing that we predicted was that Hadoop certainly would change the game, and that did. We also predicted that there wouldn't be a Red Hat for Hadoop, that was a production. The other prediction was is that we said Hadoop won't kill data warehouses, it didn't, and then data lakes came along. You know my position on data lakes. >> Yeah. >> I've always hated the term. I always liked data ocean because I think it was much more fluidity of the data, so I think we got that one right and data lakes still doesn't look like it's going to be panning out well. I mean, most people that deploy data lakes, it's really either not a core thing or as part of something else and it's turning into a data swamp, so I think the data lake piece is not panning out the way it, people thought it would be. I think one thing we did get right, also, is that data would be the center of the value proposition, and it continues and remains to be, and I think we're seeing that now, and we said data's the development kit back in 2010 when we said data's going to be part of programming. >> Some of the other things, our early data, and we went out and we talked to a lot of practitioners who are the, it was hard to find in the early days. They were just a select few, I mean, other than inside of Google and Yahoo! But what they told us is that things like SQL and the enterprise data warehouse were key components on their big data strategy, so to your point, you know, it wasn't going to kill the EDW, but it was going to surround it. The other thing we called was cloud. Four years ago our data showed clearly that much of this work, the modeling, the big data wrangling, et cetera, was being done in the cloud, and Cloudera, Hortonworks, and MapR, none of them at the time really had a cloud strategy. Today that's all they're talking about is cloud and hybrid cloud. >> Well, it's interesting, I think it was like four years ago, I think, Dave, when we actually were riffing on the notion of, you know, Cloudera's name. It's called Cloudera, you know. If you spell it out, in Cloudera we're in a cloud era, and I think we were very aggressive at that point. I think Amr Awadallah even made a comment on Twitter. He was like, "I don't understand "where you guys are coming from." We were actually saying at the time that Cloudera should actually leverage more cloud at that time, and they didn't. They stayed on their IPO track and they had to because they had everything betted on Impala and this data model that they had and being the business model, and then they went public, but I think clearly cloud is now part of Cloudera's story, and I think that's a good call, and it's not too late for them. It never was too late, but you know, Cloudera has executed. I mean, if you look at what's happened with Cloudera, they were the only game in town. When we started theCUBE we were in their office, as most people know in this industry, that we were there with Cloudera when they had like 17 employees. I thought Cloudera was going to run the table, but then what happened was Hortonworks came out of the Yahoo! That, I think, changed the game and I think in that competitive battle between Hortonworks and Cloudera, in my opinion, changed the industry, because if Hortonworks did not come out of Yahoo! Cloudera would've had an uncontested run. I think the landscape of the ecosystem would look completely different had Hortonworks not competed, because you think about, Dave, they had that competitive battle for years. The Hortonworks-Cloudera battle, and I think it changed the industry. I think it couldn't been a different outcome. If Hortonworks wasn't there, I think Cloudera probably would've taken Hadoop and making it so much more, and I think they wouldn't gotten more done. >> Yeah, and I think the other point we have to make here is complexity really hurt the Hadoop ecosystem, and it was just bespoke, new projects coming out all the time, and you had Cloudera, Hortonworks, and maybe to a lesser extent MapR, doing a lot of the heavy lifting, particularly, you know, Hortonworks and Cloudera. They had to invest a lot of their R&D in making these systems work and integrating them, and you know, complexity just really broke the back of the Hadoop ecosystem, and so then Spark came in, everybody said, "Oh, Spark's going to basically replace Hadoop." You know, yes and no, the people who got Hadoop right, you know, embraced it and they still use it. Spark definitely simplified things, but now the conversation has turned to AI, John. So, I got to ask you, I'm going to use your line on you in kind of the ask-me-anything segment here. AI, is it same wine, new bottle, or is it really substantively different in your opinion? >> I think it's substantively different. I don't think it's the same wine in a new bottle. I'll tell you... Well, it's kind of, it's like the bad wine... (laughs) Is going to be kind of blended in with the good wine, which is now AI. If you look at this industry, the big data industry, if you look at what O'Reilly did with this conference. I think O'Reilly really has not done a good job with the conference of big data. I think they blew it, I think that they made it a, you know, monetization, closed system when the big data business could've been all about AI in a much deeper way. I think AI is subordinate to cloud, and you mentioned cloud earlier. If you look at all the action within the AI segment, Diane Greene talking about it at Google Next, Amazon, AI is a software layer substrate that will be underpinned by the cloud. Cloud will drive more action, you need more compute, that drives more data, more data drives the machine learning, machine learning drives the AI, so I think AI is always going to be dependent upon cloud ends or some sort of high compute resource base, and all the cloud analytics are feeding into these AI models, so I think cloud takes over AI, no doubt, and I think this whole ecosystem of big data gets subsumed under either an AWS, VMworld, Google, and Microsoft Cloud show, and then also I think specialization around data science is going to go off on its own. So, I think you're going to see the breakup of the big data industry as we know it today. Strata Hadoop, Strata Data Conference, that thing's going to crumble into multiple, fractured ecosystems. >> It's already starting to be forked. I think the other thing I want to say about Hadoop is that it actually brought such great awareness to the notion of data, putting data at the core of your company, data and data value, the ability to understand how data at least contributes to the monetization of your company. AI would not be possible without the data. Right, and we've talked about this before. You call it the innovation sandwich. The innovation sandwich, last decade, last three decades, has been Moore's law. The innovation sandwich going forward is data, machine intelligence applied to that data, and cloud for scale, and that's the sandwich of innovation over the next 10 to 20 years. >> Yeah, and I think data is everywhere, so this idea of being a categorical industry segment is a little bit off, I mean, although I know data warehouse is kind of its own category and you're seeing that, but I don't think it's like a Magic Quadrant anymore. Every quadrant has data. >> Mm-hm. >> So, I think data's fundamental, and I think that's why it's going to become a layer within a control plane of either cloud or some other system, I think. I think that's pretty clear, there's no, like, one. You can't buy big data, you can't buy AI. I think you can have AI, you know, things like TensorFlow, but it's going to be a completely... Every layer of the stack is going to be impacted by AI and data. >> And I think the big players are going to infuse their applications and their databases with machine intelligence. You're going to see this, you're certainly, you know, seeing it with IBM, the sort of Watson heavy lift. Clearly Google, Amazon, you know, Facebook, Alibaba, and Microsoft, they're infusing AI throughout their entire set of cloud services and applications and infrastructure, and I think that's good news for the practitioners. People aren't... Most companies aren't going to build their own AI, they're going to buy AI, and that's how they close the gap between the sort of data haves and the data have-nots, and again, I want to emphasize that the fundamental difference, to me anyway, is having data at the core. If you look at the top five companies in terms of market value, US companies, Facebook maybe not so much anymore because of the fake news, though Facebook will be back with it's two billion users, but Apple, Google, Facebook, Amazon, who am I... And Microsoft, those five have put data at the core and they're the most valuable companies in the stock market from a market cap standpoint, why? Because it's a recognition that that intangible value of the data is actually quite valuable, and even though banks and financial institutions are data companies, their data lives in silos. So, these five have put data at the center, surrounded it with human expertise, as opposed to having humans at the center and having data all over the place. So, how do they, how do these companies close the gap? How do the companies in the flyover states close the gap? The way they close the gap, in my view, is they buy technologies that have AI infused in it, and I think the last thing I'll say is I see cloud as the substrate, and AI, and blockchain and other services, as the automation layer on top of it. I think that's going to be the big tailwind for innovation over the next decade. >> Yeah, and obviously the theme of machine learning drives a lot of the conversations here, and that's essentially never going to go away. Machine learning is the core of AI, and I would argue that AI truly doesn't even exist yet. It's machine learning really driving the value, but to put a validation on the fact that cloud is going to be driving AI business is some of the terms in popular conversations we're hearing here in New York around this event and topic, CUBENYC and Strata Conference, is you're hearing Kubernetes and blockchain, and you know, these automation, AI operation kind of conversations. That's an IT conversation, (chuckles) so you know, that's interesting. You've got IT, really, with storage. You've got to store the data, so you can't not talk about workloads and how the data moves with workloads, so you're starting to see data and workloads kind of be tossed in the same conversation, that's a cloud conversation. That is all about multi-cloud. That's why you're seeing Kubernetes, a term I never thought I would be saying at a big data show, but Kubernetes is going to be key for moving workloads around, of which there's data involved. (chuckles) Instrumenting the workloads, data inside the workloads, data driving data. This is where AI and machine learning's going to play, so again, cloud subsumes AI, that's the story, and I think that's going to be the big trend. >> Well, and I think you're right, now. I mean, that's why you're hearing the messaging of hybrid cloud and from the big distro vendors, and the other thing is you're hearing from a lot of the no-SQL database guys, they're bringing ACID compliance, they're bringing enterprise-grade capability, so you're seeing the world is hybrid. You're seeing those two worlds come together, so... >> Their worlds, it's getting leveled in the playing field out there. It's all about enterprise, B2B, AI, cloud, and data. That's theCUBE bringing you the data here. New York City, CUBENYC, that's the hashtag. Stay with us for more coverage live in New York after this short break. (techy music)

Published Date : Sep 12 2018

SUMMARY :

Brought to you by SiliconANGLE Media for the creation at the Hadoop big data ecosystem. and some of the things we learned then some of the things that are profound about Hadoop, We were talking last night about, you know, but some of the things that we talked about back then, You said, "Well, the only Red Hat of big data might be being a reduction, the ROI was a reduction I mean, first of all, I think we got and I think we're seeing that now, and the enterprise data warehouse were key components and I think we were very aggressive at that point. Yeah, and I think the other point and all the cloud analytics are and cloud for scale, and that's the sandwich Yeah, and I think data is everywhere, and I think that's why it's going to become I think that's going to be the big tailwind and I think that's going to be the big trend. and the other thing is you're hearing New York City, CUBENYC, that's the hashtag.

ENTITIES

Entity	Category	Confidence
Apple	ORGANIZATION	0.99+
Microsoft	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
Diane Greene	PERSON	0.99+
Google	ORGANIZATION	0.99+
Facebook	ORGANIZATION	0.99+
John	PERSON	0.99+
Alibaba	ORGANIZATION	0.99+
Dave	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Jeff Hammerbacher	PERSON	0.99+
$30	QUANTITY	0.99+
New York	LOCATION	0.99+
2010	DATE	0.99+
IBM	ORGANIZATION	0.99+
Doug Cutting	PERSON	0.99+
Mike Olson	PERSON	0.99+
Hortonworks	ORGANIZATION	0.99+
Dallas	LOCATION	0.99+
O'Reilly	ORGANIZATION	0.99+
Yahoo	ORGANIZATION	0.99+
Cloudera	ORGANIZATION	0.99+
five	QUANTITY	0.99+
AWS	ORGANIZATION	0.99+
Abi Mehda	PERSON	0.99+
John Furrier	PERSON	0.99+
New York City	LOCATION	0.99+
$2.5 billion	QUANTITY	0.99+
SiliconANGLE Media	ORGANIZATION	0.99+
MapR	ORGANIZATION	0.99+
Amr Awadallah	PERSON	0.99+
$40 billion	QUANTITY	0.99+
17 employees	QUANTITY	0.99+
VMworld	ORGANIZATION	0.99+
Today	DATE	0.99+
Impala	ORGANIZATION	0.99+
Nine years	QUANTITY	0.99+
four years ago	DATE	0.98+
last night	DATE	0.98+
last decade	DATE	0.98+
Strata Data Conference	EVENT	0.98+
Strata Conference	EVENT	0.98+
Hadoop Summit	EVENT	0.98+
ninth year	QUANTITY	0.98+
Four years ago	DATE	0.98+
two worlds	QUANTITY	0.97+
five companies	QUANTITY	0.97+
today	DATE	0.97+
Strata Hadoop	EVENT	0.97+
Hadoop World	EVENT	0.96+
CUBE	ORGANIZATION	0.96+
Google Next	ORGANIZATION	0.95+
Twitter	ORGANIZATION	0.95+
this year	DATE	0.95+
Spark	ORGANIZATION	0.95+
US	LOCATION	0.94+
CUBENYC	EVENT	0.94+
Strata O'Reilly	ORGANIZATION	0.93+
next decade	DATE	0.93+

Matthew Baird, AtScale | Big Data SV 2018

>> Announcer: Live from San Jose. It's theCUBE, presenting Big Data Silicon Valley. Brought to you by SiliconANGLE Media, and it's ecosystem partners. (techno music) >> Welcome back to theCUBE, our continuing coverage on day one of our event, Big Data SV. I'm Lisa Martin with George Gilbert. We are down the street from the Strata Data Conference. We've got a great, a lot of cool stuff going on. You can see the cool set behind me. We are at Forager Tasting Room & Eatery. Come down and join us, be in our audience today. We have a cocktail event tonight, who doesn't want to join that? And we have a nice presentation tomorrow morning of our Wikibon's 2018 Big Data Forecast and Review. Joining us next is Matthew Baird the co-founder of AtScale. Matthew, welcome to theCUBE. >> Thanks for having me. Fantastic venue, by the way. >> Isn't it cool? >> This is very cool. >> Yeah, it is. So, talking about Big Data, you know, Gardner says, "85% of Big Data projects have failed." I often say failure is not a bad F word, because it can spawn the genesis of a lot of great business opportunities. Data lakes were big a few years ago, turned into swamps. AtScale has this vision of Data Lake 2.0, what is that? >> So, you're right. There have been a lot of failures, there's no doubt about it. And you're also right that is how we evolve, and we're a Silicon Valley based company. We don't give up when faced with these things. It's just another way to not do something. So, what we've seen and what we've learned through our customers is they need to have a solution that is integrated with all the technologies that they've adopted in the enterprise. And it's really about, if you're going to make a data lake, you're going to have data on there that is the crown jewels of your business. How are you going to get that in the hands of your constituents, so that they can analyze it, and they can use it to make decisions? And how can we, furthermore, do that in a way that supplies governance and auditability on top of it, so that we aren't just sending data out into the ether and not knowing where it goes? We have a lot of customers in the insurance, health insurance space, and with financial customers that the data absolutely must be managed. I think one of the biggest changes is around that integration with the current technologies. There's a lot of movement into the Cloud. The new data lake is kind of focused more on these large data stores, where it was HDFS with Hadoop. Now it's S3, Google's object storage, and Azure ADLS. Those are the sorts of things that are backing the new data lake I believe. >> So if we take these, where the Data Lake Store didn't have to be something that's a open source HDFS implementation, it could even be through just through a HDSF API. >> Matthew: Yeah, absolutely. >> What are some of the, how should we think about the data sources and feeds, for this repository, and then what is it on top that we need to put to make the data more consumable? >> Yeah, that's a good point. S3, Google Object Storage, and Azure, they all have a characteristic of, they are large stores. You can store as much as you want. They generally on the Clouds, and in the open source on-prem software for landing the data exists, for streaming the data and landing it, but the important thing there is it's cost-effective. S3 is a cost-effective storage system. HDFS is a mostly cost-effective storage system. You have to manage it, so it has a slightly higher cost, but the advice has been, get it to the place you're going to store it. Store it in a unified format. You get a halo effect when you have a unified format, and I think the industry is coalescing around... I'd probably say ParK's in the lead right now, but once ParK can be read by, let's take Amazon for instance, can be read by Athena, can be read by Redshift Spectrum, it can be read by their EMR, now you have this halo effect where your data's always there, always available to be consumed by a tool or a technology that can then deliver it to your end users. >> So when we talk about ParK, we're talking about columnar serialization format, >> Matthew: Yes. but there's more on top of that that needs to be layered, so that you can, as we were talking about earlier, combine the experience of a data warehouse, and the curated >> Absolutely data access where there's guard rails, >> Matthew: Yes >> and it's simple, versus sort of the wild west, but where I capture everything in a data lake. How do you bring those two together? >> Well, specifically for AtScale, we allow you to integrate multiple data access tools in AtScale, and then we use the appropriate tool to access the data for the use case. So let me give you an example, in the Amazon case, Redshift is wonderful for accessing interactive data, which BI users want, right? They want fast queries, sub-second queries. They don't want to pay to have all the raw data necessarily stored in Redshift 'cause that's pretty expensive. So they have this Redshift spectrum, it's sitting in S3, that's cost effective. So when we go and we read raw data to build these summary tables, to deliver the data fast, we can read from Spectrum, we can put it all together, drop it into Redshift, a much smaller volume of data, so it has faster characteristics for being accessed. And it delivers it to the user that way. We do that in Hadoop when we access via Hive for building aggregate tables, but Spark or Impala, is a much faster interactive engine, so we use those. As I step back and look at this, I think the Data Lake 2.0, from a technical perspective is about abstraction, and abstraction's sort of what separates us from the animals, right? It's a concept where we can pack a lot of sophistication and complexity behind an interface that allows people to just do what they want to do. You don't know how, or maybe you do know how a car engine works, I don't really, kind of, a little bit, but I do know how to press the gas pedal and steer. >> Right. >> I don't need to know these things, and I think the Data Lake 2.0 is about, well I don't need to know how Century, or Ranger, or Atlas, or any of these technologies work. I need to know that they're there, and when I access data, they're going to be applied to that data, and they're going to deliver me the stuff that I have access to and that I can see. >> So a couple things, it sounded like I was hearing abstraction, and you said really that's kind of the key, that sounds like a differentiator for AtScale, is giving customers that abstraction they need. But I'm also curious from a data value perspective, you talked about in Redshift from an expense perspective. Do you also help customers gain abstraction by helping them evaluate value of data and where they ought to keep it, and then you give them access to it? Or is that something that they need to do, kind of bring to the table? >> We don't really care, necessarily, about the source of the data, as long as it can be expressed in a way that can be accessed by whatever engine it is. Lift and shift is an example. There's a big move to move from Teradata or from Netezza into a Cloud-based offering. People want to lift it and shift it. It's the easiest way to do this. Same table definitions, but that's not optimized necessarily for the underlying data store. Take BigQuery for example, BigQuery's an amazing piece of technology. I think there's nothing like it out there in the market today, but if you really want BigQuery to be cost-effective, and perform and scale up to concurrency of... one of our customers is going to roll out about 8,000 users on this. You have to do things in BigQuery that are BigQuery-friendly. The data structures, the way that you store the data, repeated values, those sorts of things need to be taken into consideration when you build your schema out for consumption. With AtScale they don't need to think about that, they don't need to worry about it, we do it for them. They drop the schema in the same way that it exists on their current technology, and then behind the scenes, what we're doing is we're looking at signals, we're looking at queries, we're looking at all the different ways that people access the data naturally, and then we restructure those summary tables using algorithms and statistics, and I think people would broadly call it ML type approaches, to build out something that answers those questions, and adapts over time to new questions, and new use cases. So it's really about, imagine you had the best data engineering team in the world, in a box, they're never tired, they never stop, and they're always interacting with what the customers really want, which is "Now I want to look at the data this way". >> It's sounds actually like what your talking about is you have a whole set of sources, and targets, and you understand how they operate, but why I say you, I mean your software. And so that you can take data from wherever it's coming in, and then you apply, if it's machine learning or whatever other capabilities to learn from the access methods, how to optimize that data for that engine. >> Matthew: Exactly. >> And then the end users have an optimal experience and it's almost like the data migration service that Amazon has, it's like, you give us your Postgres or Oracle database, and we'll migrate it to the cloud. It sounds like you add a lot of intelligence to that process for decision support workloads. >> Yes. >> And figure out, so now you're going to... It's not Postgres to Postgres, but it might be Teradata to Redshift, or S3, that's going to be accessed by Athena or Redshift, and then let's put that in the right format. >> I think you sort of hit something that we've noticed is very powerful, which is if you can set up, and we've done this with a number of customers, if you can set up at the abstraction layer that is AtScale, on your on-prem data, literally in, say hours, you can move it into the Cloud, obviously you have to write the detail to move it into the Cloud, but once it's in the Cloud you take the same AtScale instance, you re-point it at that new data source, and it works. We've done that with multiple customers, and it's fast and effective, and it let's you actually try out things that you may not have the agility to do before because there's differences in how the SQL dialects work, there's differences in, potentially, how the schema might be built. >> So a couple things I'm interested in, I'm hearing two A-words, that abstraction that we've talked about a number of times, you also mention adaptability. So when you're talking with customers, what are some of the key business outcomes they need to drive, where adaptability and abstraction are concerned, in terms of like cost reduction, revenue generation. What are some of those see-swee business objectives that AtScale can help companies achieve? >> So looking at, say, a customer, a large retailer on the East Coast, everybody knows the stores, they're everywhere, they sell hardware. they have a 20-terabyte cube that they use for day-to-day revenue analytics. So they do period over period analysis. When they're looking at stores, they're looking at things like, we just tried out a new marketing approach... I was talking to somebody there last week about how they have these special stores where they completely redo one area and just see how that works. They have to be able to look at those analytics, and they run those for a short amount of time. So if you're window for getting data, refreshing data, building cubes, which in the old world could take a week, you know my co-founder at Yahoo, he had a week and a half build time. That data is now two weeks old, maybe three weeks old. There might be bugs in it-- >> And the relevance might be, pshh... >> And the relevance goes down, or you can't react as fast. I've been at companies where... Speed is so important these days, and the new companies that are grasping data aggressively, putting it somewhere where they can make decisions on it on a day-to-day basis, they're winning. And they're spending... I was at a company that was spending three million dollars on pay-per-click data, a month. If you can't get data everyday, you're on the wrong campaigns, and everything goes off the rails, and you only learn about it a week later, that's 25% of your spend, right there, gone. >> So the biggest thing, sorry George, it really sounds to me like what AtScale can facilitate for probably customers in any industry is the ability to truly make data-driven business decisions that can really directly affect revenue and profit. >> Yes, and in an agile format. So, you can build-- >> That's the third A; agile, adaptability, abstraction. >> There ya go, the three A's. (Lisa laughs) We had the three V's, now we have the three A's. >> Yes. >> The fact that you're building a curated model, so in retail the calendars are complex. I'm sure everybody that uses Tableau is good at analyzing data, but they might not know what your rules are around your financial calendar, or around the hierarchies of your product. There's a lot of things that happen where you want an enterprise group of data modelers to build it, bless it, and roll it out, but then you're a user, and you say, wait, you forgot x, y, and z, I don't want to wait a week, I don't want to wait two weeks, three weeks, a month, maybe more. I want that data to be available in the model an hour later 'cause that's what I get with Tableau today. And that's where we've taken the two approaches of enterprise analytics and self-service, and tried to create a scenario where you get the best of both worlds. >> So, we know that an implication of what you're telling us is that insights are perishable, and latency is becoming more and more critical. How do you plan to work with streaming data where you've got a historical archive, but you've got fresh data coming in? But fresh could mean a variety of things. Tell us what some of those scenarios look like. >> Absolutely, I think there's two approaches to this problem, and I'm seeing both used in practice, and I'm not exactly sure, although I have some theories on which one's going to win. In one case, you are streaming everything into, sort of a... like I talked about, this data lake, S3, and you're putting it in a format like ParK, and then people are accessing it. The other way is access the data where it is. Maybe it's already in, this is a common BI scenario, you have a big data store, and then you have a dimensional data store, like Oracle has your customers, Hadoop has machine data about those customers accessing on their mobile devices or something. If there was some way to access those data without having to move the Oracle stuff into the big data store, that's a Federation story that I think we've talked about in the Bay Area for a long time, or around the world for a long time. I think we're getting closer to understanding how we can do that in practice, and have it be tenable. You don't move the big data around, you move the small data around. For data coming in from outside sources it's probably a little bit more difficult, but it is kind of a degenerate version of the same story. I would say that streaming is gaining a lot of momentum, and with what we do, we're always mapping, because of the governance piece that we've built into the product, we're always mapping where did the data come from, where did it land, and how did we use it to build summary tables. So if we build five summary tables, 'cause we're answering different types of questions, we still need to know that it goes back to this piece of data, which has these security constraints, and these audit requirements, and we always track it back to that, and we always apply those to our derived data. So when you're accessing this automatically ETLed summary tables, it just works the way it is. So I think that there are two ways that this is going to expand and I'm excited about Federation because I think the time has come. I'm also excited about streaming. I think they can serve two different use cases, and I don't actually know what the answer will be, because I've seen both in customers, it's some of the biggest customers we have. >> Well Matthew thank you so much for stopping by, and four A's, AtScale can facilitate abstraction, adaptability, and agility. >> Yes. Hashtag four A's. >> There we go. I don't even want credit for that. (laughs) >> Oh wow, I'm going to get five more followers, I know it! (George laughs) >> There ya go! >> We want to thank you for watching theCUBE, I am Lisa Martin, we are live in San Jose, at our event Big Data SV, I'm with George Gilbert. Stick around, we'll be back with our next guest after a short break. (techno music)

Published Date : Mar 7 2018

SUMMARY :

Brought to you by SiliconANGLE Media, We are down the street from the Strata Data Conference. Thanks for having me. because it can spawn the genesis that is the crown jewels of your business. So if we take these, that can then deliver it to your end users. and the curated and it's simple, versus sort of the wild west, And it delivers it to the user that way. and they're going to deliver me the stuff and then you give them access to it? The data structures, the way that you store the data, And so that you can take data and it's almost like the data migration service but it might be Teradata to Redshift, and it let's you actually try out things they need to drive, and just see how that works. And the relevance goes down, or you can't react as fast. is the ability to truly make data-driven business decisions Yes, and in an agile format. We had the three V's, now we have the three A's. where you get the best of both worlds. How do you plan to work with streaming data and then you have a dimensional data store, and four A's, AtScale can facilitate abstraction, Yes. I don't even want credit for that. We want to thank you for watching theCUBE,

ENTITIES

Entity	Category	Confidence
Matthew	PERSON	0.99+
George Gilbert	PERSON	0.99+
Lisa Martin	PERSON	0.99+
Matthew Baird	PERSON	0.99+
George	PERSON	0.99+
San Jose	LOCATION	0.99+
Yahoo	ORGANIZATION	0.99+
three weeks	QUANTITY	0.99+
Amazon	ORGANIZATION	0.99+
25%	QUANTITY	0.99+
Gardner	PERSON	0.99+
two approaches	QUANTITY	0.99+
Oracle	ORGANIZATION	0.99+
two weeks	QUANTITY	0.99+
Redshift	TITLE	0.99+
S3	TITLE	0.99+
three million dollars	QUANTITY	0.99+
two ways	QUANTITY	0.99+
Silicon Valley	LOCATION	0.99+
one case	QUANTITY	0.99+
85%	QUANTITY	0.99+
last week	DATE	0.99+
a month	QUANTITY	0.99+
Century	ORGANIZATION	0.99+
SiliconANGLE Media	ORGANIZATION	0.99+
a week	QUANTITY	0.99+
BigQuery	TITLE	0.99+
both	QUANTITY	0.99+
20-terabyte	QUANTITY	0.99+
Google	ORGANIZATION	0.99+
a week and a half	QUANTITY	0.99+
a week later	DATE	0.99+
Data Lake 2.0	COMMERCIAL_ITEM	0.99+
two	QUANTITY	0.99+
tomorrow morning	DATE	0.99+
AtScale	ORGANIZATION	0.99+
Atlas	ORGANIZATION	0.99+
Bay Area	LOCATION	0.98+
Lisa	PERSON	0.98+
ParK	TITLE	0.98+
Tableau	TITLE	0.98+
five more followers	QUANTITY	0.98+
an hour later	DATE	0.98+
Ranger	ORGANIZATION	0.98+
Netezza	ORGANIZATION	0.98+
tonight	DATE	0.97+
today	DATE	0.97+
both worlds	QUANTITY	0.97+
about 8,000 users	QUANTITY	0.97+
theCUBE	ORGANIZATION	0.97+
Strata Data Conference	EVENT	0.97+
one	QUANTITY	0.97+
Big Data SV 2018	EVENT	0.97+
Teradata	ORGANIZATION	0.96+
AtScale	TITLE	0.96+
Big Data SV	EVENT	0.93+
East Coast	LOCATION	0.93+
Hadoop	TITLE	0.92+
two different use cases	QUANTITY	0.92+
day one	QUANTITY	0.91+
one area	QUANTITY	0.91+

Wikibon Conversation with John Furrier and George Gilbert

(upbeat electronic music) >> Hello, everyone. Welcome to the Cube Studios in Palo Alto, California. I'm John Furrier, the co-host of the Cube and co-founder of SiliconANGLE Media Inc. I'm here with George Gilbert for a Wikibon conversation on the state of the big data. George Gilbert is the analyst at Wikibon covering big data. George, great to see you. Looking good. (laughing) >> Good to see you, John. >> So George, you're obviously covering big data. Everyone knows you. You always ask the tough questions, you're always drilling down, going under the hood, and really inspecting all the trends, and also looking at the technology. What are you working on these days as the big data analyst? What's the hot thing that you're covering? >> OK, so, what's really interesting is we've got this emerging class of applications. The name that we've used so far is modern operational analytic applications. Operational in the sense that they help drive business operations, but analytical in the sense that the analytics either inform or drive transactions, or anticipate and inform interactions with people. That's the core of this class of apps. And then there are some sort of big challenges that customers are having in trying to build, and deploy, and operate these things. That's what I want to go through. >> George, you know, this is a great piece. I can't wait to (mumbling) some of these questions and ask you some pointed questions. But I would agree with you that to me, the number one thing I see customers either fumbling with or accelerating value with is how to operationalize some of the data in a way that they've never done it before. So you start to see disciplines come together. You're starting to see people with a notion of digital business being something that's not a department, it's not a marketing department. Data is everywhere, it's horizontally scalable, and the smart executives are really looking at new operational tactics to do that. With that, let me kick off the first question to you. People are trying to balance the cloud, On Premise, and The Edge, OK. And that's classic, you're seeing that now. I've got a data center, I have to go to the cloud, a hybrid cloud. And now the edge of the network. We were just taking about Block Chain today, there's this huge problem. They've got the balance that, but they've got to balance it versus leveraging specialized services. How do you respond to that? What is your reaction? What is your presentation? >> OK, so let's turn it into something really concrete that everyone can relate to, and then I'll generalize it. The concrete version is for a number of years, everyone associated Hadoop with big data. And Hadoop, you tried to stand up on a cluster on your own premises, for the most part. It was on had EMR, but sort of the big company activity outside, even including the big tech companies was stand up a Hadoop cluster as a pilot and start building a data lake. Then see what you could do with sort of huge amounts of data that you couldn't normally sort of collect and analyze. The operational challenges of standing up that sort of cluster was rather overwhelming, and I'll explain that later, so sort of park that thought. Because of that complexity, more and more customers, all but the most sophisticated, are saying we need a cloud strategy for that. But once you start taking Hadoop into the cloud, the components of this big data analytic system, you have tons more alternatives. So whereas in Cloudera's version of Hadoop you had Impala as your MPP sequel database. On Amazon, you've got Amazon Redshift, you've got Snowflake, you've got dozens up MPP sequel databases. And so the whole playing field shifts. And not only that, Amazon has instrumented their, in that particular case, their application, to be more of a more managed service, so there's a whole lot less for admins to do. And you take that on sort of, if you look at the slides, you take every step in that pipeline. And when you put it on a different cloud, it's got different competitors. And even if you take the same step in a pipeline, let's say Spark on HDFS to do your ETL, and your analysis, and your shaping of data, and even some of the machine learning, you put that on Azure and on Amazon, it's actually on different storage foundation. So even if you're using the same component, it's different. There's a lot of complexity and a lot of trade off that you got to make. >> Is that a problem for customers? >> Yes, because all of a sudden, they have to evaluate what those trade offs are. They have to evaluate the trade off between specialization. Do I use the best to breed thing on one platform. And if I do, it's not compatible with what I might be running on prem. >> That'll slow a lot of things down. I can tell you right now, people want to have the same code base on all environments, and then just have the same seamless operational role. OK, that's a great point, George. Thanks for sharing that. The second point here is harmonizing and simplifying management across hybrid clouds. Again, back to your point. You set that up beautifully. Great example, open source innovation hits a roadblock. And the roadblock is incompatible components in multiple clouds. That's a problem. It's a management nightmare. How do harmonization about hybrid cloud work? >> You couldn't have asked it better. Let me put it up in terms of an X Y chart where on the x-axis, you have the components of an analytic pipeline. Ingest, process, analyze, predict, serve. But then on the y-axis, this is for an admin, not a developer. These are just some of the tasks they have to worry about. Data governance, performance monitoring, scheduling and orchestration, availability and recovery, that whole list. Now, if you have a different product for each step in that pipeline, and each product has a different way of handling all those admin tasks, you're basically taking all the unique activities on the y-axis, multiplying it by all the unique products on the x-axis, and you have overwhelming complexity, even if these are managed services on the cloud. Here now you've got several trade offs. Do I use the specialized products that you would call best to breed? Do I try and do end to end integration so I get simplification across the pipeline? Or do I use products that I had on-prem, like you were saying, so that I have seamless compatibility? Or do I use the cloud vendors? That's a tough trade off. There's another similar one for developers. Again, on the y-axis, for all the things that a developer would have to deal with, not all of them, just a sample. The data model and the data itself, how to address it, the programing model, the persistence. So on that y-axis, you multiply all those different things you have to master for each product. And then on the x-axis, all the different products and the pipeline. And you have that same trade off, again. >> Complexity is off the charts. >> Right. And you can trade end to end integration to simplify the complexity, but we don't really have products that are fully fleshed out and mature that stretch from one end of the pipeline to the other, so that's a challenge. Alright. Let's talk about another way of looking at management. This was looking at the administrators and the developers. Now, we're getting better and better software for monitoring performance and operations, and trying to diagnose root cause when something goes wrong and then remediate it. There's two real approaches. One is you go really deep, but on a narrow part of your application and infrastructure landscape. And that narrow part might be, you know, your analytic pipeline, your big data. The broad approach is to get end to end visibility across Edge with your IOT devices, across on-prem, perhaps even across multiple clouds. That's the breadth approach, end to end visibility. Now, there's a trade off here too as in all technology choices. When you go deep, you have bounded visibility, but that bounded visibility allows you to understand exactly what is in that set of services, how they fit together, how they work. Because the vendor, knowing that they're only giving you management of your big data pipeline, they can train their models, their machine learning models, so that whenever something goes wrong, they know exactly what caused it and they can filter out all the false positives, the scattered errors that can confuse administrators. Whereas if you want breadth, you want to see end to end your entire landscape so that you can do capacity planning and see if there was an error way upstream, something might be triggered way downstream or a bunch of things downstream. So the best way to understand this is how much knowledge do you have of all the pieces work together, and how much knowledge you have of all the pieces, the software pieces fit together. >> This is actually an interesting point. So if I kind of connect the dots for you here is the bounded root cause analysis that we see a lot of machine learning, that's where the automation is. >> George: Yeah. >> The unbounded, the breadth, that's where the data volume is. But they can work together, that's what you're saying. >> Yes. And actually, I hadn't even got to that, so thanks for taking it out. >> John: Did I jump ahead on that one? (laughing) >> No, no, you teed it out. (laughing) Because ultimately-- >> Well a lot of people want to know where it's going to be automated away. All the undifferentiated labored and scale can be automated. >> Well, when you talk about them working together. So for the deep depth first, there's a small company called Unravel Data that sort of modeled eight million jobs or workloads of big data workloads from high tech companies, so they know how all that fits together and they can tell you when something goes wrong exactly what goes wrong and how to remediate it. So take something like Rocana or Splunk, they look end to end. The interesting thing that you brought up is at some point, that end to end product is going to be like a data warehouse and the depth products are going to sit on top of it. So you'll have all the contextual data of your end to end landscape, but you'll have the deep knowledge of how things work and what goes wrong sitting on it. >> So just before we jump to the machine learning question which I want to ask you, what you're saying is the industry is evolving to almost looking like a data warehouse model, but in a completely different way. >> Yeah. Think of it as, another cue. (laughing) >> John: That's what I do, George. I help you out with the cues. (laughing) No, but I mean the data warehouse, everyone knows what that was. A huge industry, created a lot of value, but then the world got rocked by unstructured data. And then their bounded, if you will, view has got democratized. So creative destruction happened which is another word for new entrants came in and incumbents got rattled. But now it's kind of going back to what looks like a data warheouse, but it's completely distributed around. >> Yes. And I was going to do one of my movie references, but-- >> No, don't do it. Save us the judge. >> If you look at this starting in the upper right, that's the data lake where you're collecting all the data and it's for search, it's exploratory. As you get more structure, you get to the descriptive place where you can build dashboards to monitor what's going on. And you get really deep, that's when you have the machine learning. >> Well, the machine learning is hitting the low hanging fruit, and that's where I want to get to next to move it along. Sourcing machine learning capability, let's discuss that. >> OK, alright. Just to set contacts before we get there, notice that when you do end to end visibility, you're really seeing across a broad landscape. And when I'm showing my public cloud big data, that would be depth first just for that component. But you would do breadth first, you could do like a Rocana or a Splunk that then sees across everything. The point I wanted to make was when you said we're reverting back to data warehouses and revisiting that dream again, the management applications started out as saying we know how to look inside machine data and tell you what's going on with your landscape. It turns out that machine data and business operations data, your application data, are really becoming one and the same. So what used to be a transaction, there was one transaction. And that, when you summarized them, that went into the data warehouse. Then we had with systems of engagement, you had about 100 interaction events that you tracked or sort of stored for everything business transaction. And then when we went out to the big data world, it's so resource intensive that we actually had 1,000 to 10,000 infrastructure events for every business transaction. So that's why the data volumes have grown so much and why we had to go back first to data lake, and then curate it to the warehouse. >> Classic innovation story, great. Machine learning. Sourcing machine learning capabilities 'cause that's where the rubber starts hitting the road. You're starting to see clear skies when it comes to where machine learning is starting fit in. Sourcing machine learning capabilities. >> You know, even though we sort of didn't really rehearse this, you're helping cue me on perfectly. Let me make the assertion that with machine learning, we have the same shortage of really trained data scientists that we had when we were trying to stand up Hadoop clusters and do big data analytics. We did not have enough administrators because these were open source components built from essentially different projects, and putting them all together required a huge amount of skills. Data science requires, really, knowledge of algorithms that even really sophisticated programmers will tell you, "Jeez, now I need a PhD "to really understand how this stuff works." So the shortage, that means we're not going to get a lot of hand-built machine learning applications for a while. >> John: In a lot of libraries out there right now, you see TensorFlow from Google. Big traction with that application. >> George: But for PhDs, for PhDs. My contention is-- >> John: Well developers too, you could argue developers, but I'm just putting it out there. >> George: I will get to that, actually. A slide just on that. Let me do this one first because my contention is the first big application, widespread application of machine learning, is going to be the depth first management because it comes with a model built in of how all the big data workloads, services, and infrastructure fit together and work together. And if you look at how the machine learning model operates, when it knows something goes wrong, let's say an analytic job takes 17 hours and then just falls over and crashes, the model can actually look at the data layout and say we have way too much on one node, and it can change the settings and change the layout or the data because it knows how all the stuff works. The point about this is the vendor. In this particular example, Unravel Data, they built into their model an understanding of how to keep a big data workload running as opposed to telling the customer, "You have to program it." So that fits into the question you were just asking which is where do you get this talent. When you were talking about like TensorFlow, and Cafe, and Torch, and MXnet, those are all like assembly language. Yes, those are the most powerful places you could go to program machine learning. But the number of people is inversely proportional to the power of those. >> John: Yeah, those are like really unique specialty people. High, you know, the top guys. >> George: Lab coats, rocket scientists. >> John: Well yeah, just high end tier one coders, tier one brains coding away, AI gurus. This is not your working developer. >> George: But if you go up two levels. So go up one level is Amazon machine learning, Spark machine learning. Go up another level, and I'm using Amazon as an example here. Amazon has a vision service called Recognition. They have a speech generation service, Natural Language. Those are developer ready. And when I say developer ready, I mean developer just uses an API, you know, passes in the data that comes out. He doesn't have to know how the model works. >> John: It's kind of like what DevOps was for cloud at the end of the day. This slide is completely accurate in my opinion. And we're at the early days and you're starting to see the platforms develop. It's the classic abstraction layer. Whoever can extract away the complexity as AI and machine learning grows is going to be the winning platform, no doubt about it. Amazon is showing some good moves there. >> George: And you know how they abstracted away. In traditional programming, it was just building higher and higher APIs, more accessible. In machine learning, you can't do that. You have to actually train the models which means you need data. So if you look at the big cloud vendors right now. So Google, Microsoft, Amazon, and IBM. Most of them, the first three, they have a lot of data from their B to C businesses. So you know, people talking to Echo, people talking to Google Assistant or Siri. That's where they get enough of their speech. >> John: So data equals power? >> George: Yes. >> By having data, you have the ingredients. And the more data that you have, the more data that you know about, the more data that has information around it, the more effective it can be to train machine learning algorithms. >> Yes. >> And the benefit comes back to the people who have the data. >> Yes. And so even though your capabilities get narrower, 'cause you could do anything on TensorFlow. >> John: Well, that's why Facebook is getting killed right now just to kind of change tangents. They have all this data and people are very unhappy, they just released that the Russians were targeting anti-semitic advertising, they enabled that. So it's hard to be a data platform and still provide user utility. This is what's going on. Whoever has the data has the power. It was a Frankenstein moment for Facebook. So there's that out there for everyone. How do companies do the right thing? >> And there's also the issue of customer intellectual property protection. As consumers, we're like you can take our voice, you can take all our speech to Siri or to Echo or whatever and get better at recognizing speech because we've given up control of that 'cause we want those services for free. >> Whoever can shift the data value to the users. >> George: To the developers. >> Or to the developers, or communities, better said, will win. >> OK. >> In my opinion, that's my opinion. >> For the most part, Amazon, Microsoft, and Google have similar data assets. For the most part, so far. IBM has something different which is they work closely with their industry customers and they build progressively. They're working with Mercedes, they're working with BMW. They'll work on the connected car, you know, the autonomous car, and they build out those models slowly. >> So George, this slide is really really interesting and I think this should be a roadmap for all customers to look at to try to peg where they are in the machine learning journey. But then the question comes in. They do the blocking and tackling, they have the foundational low level stuff done, they're building the models, they're understanding the mission, they have the right organizational mindset and personnel. Now, they want to orchestrate it and implement it into action. That's the final question. How do you orchestrate the distributed machine learning feedback and the data coherency? How do you get this thing scaling? How do these machines and the training happen so you have the breadth, and then you could bring the machine learning up the curve into the dashboard? >> OK. We've saved the best for last. It's not easy. When I show the chevrons, that's the analytic data pipeline. And imagine in the serve and predict at the very end, let's take an IOT app, a very sophisticated one. which would be an autonomous car. And it doesn't actually have to be an autonomous one, you could just be collected a lot of information off the car to do a better job insuring it, the insurance company. But the key then is you're collecting data on a fleet of cars, right? You're collecting data off each one, but you're also collecting then the fleet. And that, in the cloud, is where you keep improving your model of how the car works. You run simulations to figure out not just how to design better ones in the future, but how to tune and optimize the ones that are on the road now. That's number three. And then in four, you push that feedback back out to the cars on the road. And you have to manage, and this is tricky, you have to make sure that the models that you trained in step three are coherent, or the same, when you take out the fleet data and then you put the model for a particular instance of a car back out on the highway. >> George, this is a great example, and I think this slide really represents the modern analytical operational role in digital business. You can't look further than Tesla, this is essentially Tesla, and now all cars as a great example 'cause it's complex, it's an internet (mumbling) device, it's on the edge of the network, it's mobility, it's using 5G. It encapsulates everything that you are presenting, so I think this is example, is a great one, of the modern operational analytic applications that supports digital business. Thanks for joining this Wikibon conversaion. >> Thank you, John. >> George Gilbert, the analyst at Wikibon covering big data and the modern operational analytical system supporting digital business. It's data driven. The people with the data can train the machines that have the power. That's the mandate, that's the action item. I'm John Furrier with George Gilbert. Thanks for watching. (upbeat electronic music)

Published Date : Sep 23 2017

SUMMARY :

George Gilbert is the analyst at Wikibon covering big data. and really inspecting all the trends, that the analytics either inform or drive transactions, With that, let me kick off the first question to you. And even if you take the same step in a pipeline, they have to evaluate what those trade offs are. And the roadblock is These are just some of the tasks they have to worry about. that stretch from one end of the pipeline to the other, So if I kind of connect the dots for you here But they can work together, that's what you're saying. And actually, I hadn't even got to that, No, no, you teed it out. All the undifferentiated labored and scale can be automated. and the depth products are going to sit on top of it. to almost looking like a data warehouse model, Think of it as, another cue. And then their bounded, if you will, view And I was going to do one of my movie references, but-- No, don't do it. that's when you have the machine learning. is hitting the low hanging fruit, and tell you what's going on with your landscape. You're starting to see clear skies So the shortage, that means we're not going to get you see TensorFlow from Google. George: But for PhDs, for PhDs. John: Well developers too, you could argue developers, So that fits into the question you were just asking High, you know, the top guys. This is not your working developer. George: But if you go up two levels. at the end of the day. So if you look at the big cloud vendors right now. And the more data that you have, And the benefit comes back to the people 'cause you could do anything on TensorFlow. Whoever has the data has the power. you can take all our speech to Siri or to Echo or whatever Or to the developers, you know, the autonomous car, and then you could bring the machine learning up the curve or the same, when you take out the fleet data It encapsulates everything that you are presenting, and the modern operational analytical system

ENTITIES

Entity	Category	Confidence
Amazon	ORGANIZATION	0.99+
IBM	ORGANIZATION	0.99+
Microsoft	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
Mercedes	ORGANIZATION	0.99+
George Gilbert	PERSON	0.99+
George	PERSON	0.99+
John	PERSON	0.99+
BMW	ORGANIZATION	0.99+
John Furrier	PERSON	0.99+
1,000	QUANTITY	0.99+
Facebook	ORGANIZATION	0.99+
SiliconANGLE Media Inc.	ORGANIZATION	0.99+
first	QUANTITY	0.99+
second point	QUANTITY	0.99+
17 hours	QUANTITY	0.99+
Siri	TITLE	0.99+
Wikibon	ORGANIZATION	0.99+
Hadoop	TITLE	0.99+
first question	QUANTITY	0.99+
Palo Alto, California	LOCATION	0.99+
eight million jobs	QUANTITY	0.99+
Echo	COMMERCIAL_ITEM	0.99+
two levels	QUANTITY	0.99+
Tesla	ORGANIZATION	0.99+
One	QUANTITY	0.99+
each product	QUANTITY	0.99+
each step	QUANTITY	0.99+
first three	QUANTITY	0.98+
Cube Studios	ORGANIZATION	0.98+
one level	QUANTITY	0.98+
one platform	QUANTITY	0.98+
Rocana	ORGANIZATION	0.98+
one transaction	QUANTITY	0.97+
about 100 interaction	QUANTITY	0.97+
dozens	QUANTITY	0.96+
four	QUANTITY	0.96+
Cube	ORGANIZATION	0.96+
one end	QUANTITY	0.96+
each one	QUANTITY	0.96+
Google Assistant	TITLE	0.96+
two real approaches	QUANTITY	0.94+
Unravel Data	ORGANIZATION	0.94+
one	QUANTITY	0.93+
today	DATE	0.92+

Mark Grover & Jennifer Wu | Spark Summit 2017

>> Announcer: Live from San Francisco, it's the Cube covering Spark Summit 2017, brought to you by databricks. >> Hi, we're back here where the Cube is live, and I didn't even know it Welcome, we're at Spark Summit 2017. Having so much fun talking to our guests I didn't know the camera was on. We are doing a talk with Cloudera, a couple of experts that we have here. First is Mark Grover, who's a software engineer and an author. He wrote the book, "Dupe Application Architectures." Mark, welcome to the show. >> Mark: Thank you very much. Glad to be here. And just to his left we also have Jennifer Wu, and Jennifer's director of product management at Cloudera. Did I get that right? >> That's right. I'm happy to be here, too. >> Alright, great to have you. Why don't we get started talking a little bit more about what Cloudera is maybe introducing new at the show? I saw a booth over here. Mark, do you want to get started? >> Mark: Yeah, there are two exciting things that we've launched at least recently. There Cloudera Altus, which is for transient work loads and being able to do ETL-Like workloads, and Jennifer will be happy to talk more about that. And then there's Cloudera data science workbench, which is this tool that allows folks to use data science at scale. So, get away from doing data science in silos on your personal laptops, and do it in a secure environment on cloud. >> Alright, well, let's jump into Data Science Workbench first. Tell me a little bit more about that, and you mentioned it's for exploratory data science. So give us a little more detail on what it does. >> Yeah, absolutely. So, there was private beta for Cloudera Data Science Workbench earlier in the year and then it was GA a few months ago. And it's like you said, an exploratory data science tool that brings data science to the masses within an enterprise. Previously people used to have, it was this dichotomy, right? As a data scientist, I want to have the latest and greatest tools. I want to use the latest version of Python, the latest notebook kernel, and I want to be able to use R and Python to be able to crunch this data and run my models in machine learning. However, on the other side of this dichotomy are the IT organization of the organization, where if they want to make sure that all tools are compliant and that your clusters are secure, and your data is not going into places that are not secured by state of the art security solutions, like Kerberos for example, right? And of course if the data scientists are putting the data on their laptops and taking the laptop around to wherever they go, that's not really a solution. So, that was one problem. And the other one was if you were to bring them all together in the same solution, data scientists have different requirements. One may want to use Python 2.6. Another one maybe want to use 3.2, right? And so Cloudera Data Science Workbench is a new product that allows data scientists to visualize and do machine learning through this very nice notebook-like interface, share their work with the rest of their colleagues in the organization, but also allows you to keep your clusters secure. So it allows you to run against a Kerberized cluster, allows single sign on to your web interface to Data Science Workbench, and provides a really nice developer experience in the sense that My workflow and my tools and my version of Python does not conflict with Jennifer's version of Python. We all have our own docker and Kubernetes-based infrastructure that makes sure that we use the packages that we need, and they don't interfere with each other. We're going to go to Jennifer on Altus in just a few minutes, but George first give you a chance to maybe dig in on Data Science workshop. >> Two questions on the data science side: some of the really toughest nuts to crack have been Sort of a common environment for the collaborators, but also the ability to operationalize the models once you've sort of agreed on them, and manage the lifecycle across teams, you know? Like, challenger champion, promote something, or even before that doing the ab testing, and then sort of what's in production is typically in a different language from what, you know, it was designed in and sort of integrating it with the apps. Where is that on the road map? Cause no one really has a good answer for that. >> Yeah, that's an excellent question. In general I think it's the problem to crack these days. How do you productionalize something that was written by a data scientist in a notebook-like system onto the production cluster, right? And I think the part where the data scientist works in a different language than the language that's in production, I think that problem, the best I can say right now is to actually have someone rewrite that. Have someone rewrite that in the language you're going to make in production, right? I don't see that to be the more common part. I think the more widespread problem is even when the language is production, how do you go making the part that the data scientist wrote, the model or whatever that would be, into a prodution cluster? And so, Data Science Workbench in particular runs on the same cluster that is being managed by Cloudera manager, right? So this is a tool that you install, but that is available to you as a web server, as a web interface, and so that allows you to move your development machine learning algorithms from your data science workbench to production much more easier, because it's all running on the same hardware and same systems. There's no separate Cloudera managers that you have to use to manage the workbench compared to your actual cluster. >> Okay. A tangential question, but one of the, the difficulties of doing machine learning is finding all the training data and, and sort of data science expertise to sit with the domain expert to, you know, figure out proper model of features, things like that. One of the things we've seen so far from the cloud vendors is they take their huge datasets in terms of voice, you know, images. They do the natural language understanding, speech or rather text to speech, you know, facial recognition. Cause they have such huge datasets they can train on. We're hearing noises that they'd going to take that down to the more mundane statistical kind of machine learning algorithms, so that you wouldn't be, like, here's a algorithm to do churn, you know, go to town, but that they might have something that's already kind of pre-populated that you would just customize. Is that something that you guys would tackle, too? >> I can't speak for the road map in that sense, but I think some of that problem needs to be tackled by projects like Spark for example. So I think as the stack matures, it's going to raise the level of abstraction as time goes on. And I think whatever benefits Spark ecosystem will have will come directly to distributions like Cloudera. >> George: That's interesting. >> Yeah >> Okay >> Alright, well let's go to Jennifer now and talk about Altus a little bit. Now you've been on the Cube show before, right? >> I have not. >> Okay, well, familiar with your work. Tell us again, you're the product manager for Altus. What does it do, and what was the motivation to build it? >> Yeah, we're really excited about Cloudera Altus. So, we released Cloudera Altus in its first GA form in April, and we launched Cloudera Altus in a public environment in Strata London about two weeks ago, so we're really excited about this and we are very excited to now open this up to all of the customer base. And what it is is a platform as a service offering designed to leverage, basically, the agility and the scale of cloud, and make a very easy to use type of experience to expose Cloudera capacity for, in particular for data engineering type of workloads. So the end user will be able to very easily, in a very agile manner, get data engineering capacity on Cloudera in the cloud, and they'll be able to do things like ETL and large scale data processing, productionized machine learning workflows in the cloud with this new data engineering as a service experience. And we wanted to abstract away the cloud, and cluster operations, and make the end user a really, the end user experience very easy. So, jobs and workloads as first class objects. You can do things like submit jobs, clone jobs, terminate jobs, troubleshoot jobs. We wanted to make this very, very easy for the data engineering end user. >> It does sound like you've sort of abstracted away a lot of the infrastructure that you would associate with on-prem, and sort of almost make it, like, programmable and invisible. But, um, I guess my, one of my questions is when you put it in a cloud environment, when you're on-prem you have a certain set of competitors which is kind of restrictive, because you are the standalone platform. But when you go on the cloud, someone might say, "I want to use red shift on Amazon," or Snowflake, you know, as the MPP sequel database at the end of a pipeline. And it's not just, I'm using those as examples. There's, you know, dozens, hundreds, thousands of other services to choose from. >> Yes. >> What happens to the integrity of that platform if someone carves off one piece? >> Right. So, interoperability and a unified data pipeline is very important to us, so we want to make sure that we can still service the entire data pipeline all the way from ingest and data processing to analytics. So our team has 24 different open source components that we deliver in the CDH distribution, and we have committers across the entire stack. We know the application, and we want to make sure that everything's interoperable, no matter how you deploy the cluster. So if you deploy data engineering clusters through Cloudera Altus, but you deployed Impala clusters for data marks in the cloud through Cloudera Director or through any other format, we want all these clusters to be interoperable, and we've taken great pains in order to make everything work together well. >> George: Okay. So how do Altus and Sata Science Workbench interoperate with Spark? Maybe start with >> You want to go first with Altus? >> Sure, so, we, in terms of interoperability we focus on things like making sure there are no data silos so that the data that you use for your entire data lake can be consumed by the different components in our system, the different compute engines and different tools, and so if you're processing data you can also look at this data and visualize this data through Data Science Workbench. So after you do data ingestion and data processing, you can use any of the other analytic tools and then, and this includes Data Science Workbench. >> Right, and for Data Science Workbench runs, for example, with the latest version of Spark you could pick, the currently latest released version of Spark, Spark 2.1, Spark 2.2 is being boarded of course, and that will soon be integrated after its release. For example you could use Data Science Workbench with your flavor of Spark two's version and you can run PySpark or Scala jobs on this notebook-like interface, be able to share your work, and because you're using Spark Underneath the hood it uses yarn for resource management, the Data Science Workbench itself uses Docker for configuration management, and Kubernetes for resource managing these Docker containers. >> What would be, if you had to describe sort of the edge conditions and the sweet spot of the application, I mean you talked about data engineering. One thing, we were talking to Matei Zaharia and Ronald Chin about was, and Ali Ghodsi as well was if you put Spark on a database, or at least a, you know, sophisticated storage manager, like Kudu, all of a sudden there're a whole new class of jobs or applications that open up. Have you guys thought about what that might look like in the future, and what new applications you would tackle? >> I think a lot of that benefit, for example, could be coming from the underlying storage engine. So let's take Spark on Kudu, for example. The inherent characteristics of Kudu today allow you to do updates without having to either deal with the complexity of something like Hbase, or the crappy performance of dealing HDFS compactions, right? So the sweet spot comes from Kudu's capabilities. Of course it doesn't support transactions or anything like that today, but imagine putting something like Spark and being able to use the machine learning libraries and, we have been limited so far in the machine learning algorithms that we have implemented in Spark by the storage system sometimes, and, for example new machine learning algorithms or the existing ones could rewritten to make use of the update features for example, in Kudu. >> And so, it sounds like it makes it, the machine learning pipeline might get richer, but I'm not hearing that, and maybe this isn't sort of in the near term sort of roadmap, the idea that you would build sort of operational apps that have these sophisticated analytics built in, you know, where the analytics, um, you've done the training but at run time, you know, the inferencing influences a transaction, influences a decision. Is that something that you would foresee? >> I think that's totally possible. Again, at the core of it is the part that now you have one storage system that can do scans really well, and it can also do random reads and writes any place, right? So as your, and so that allows applications which were previously siloed because one appication that ran off of HDFS, another application that ran out of Hbase, and then so you had to correlate them to just being one single application that can use to train and then also use their trained data to then make decisions on the new transactions that come in. >> So that's very much within the sort of scope of imagination, or scope. That's part of sort of the ultimate plan? >> Mark: I think it's definitely conceivable now, yeah. >> Okay. >> We're up against a hard break coming up in just a minute, so you each get a 30-second answer here, so it's the same question. You've been here for a day and a half now. What's the most surprising thing you've learned that you thing should be shared more broadly with the Spark community? Let's start with you. >> I think one of the great things that's happening in Spark today is people have been complaining about latency for a long time. So if you saw the keynote yesterday, you would see that Spark is making forays into reducing that latency. And if you are interested in Spark, using Spark, it's very exciting news. You should keep tabs on it. We hope to deliver lower latency as a community sooner. >> How long is one millisecond? (Mark laughs) >> Yeah, I'm largely focused on cloud infrastructure and I found here at the conference that, like, many many people are very much prepared to actually start taking more, you know, more POCs and more interest in cloud and the response in terms of all of this in Altus has been very encouraging. >> Great. Well, Jennifer, Mark, thank you so much for spending some time here on the Cube with us today. We're going to come by your booth and chat a little bit more later. It's some interesting stuff. And thank you all for watching the Cube today here at Spark Summit 2017, and thanks to Cloudera for bringing us these two experts. And thank you for watching. We'll see you again in just a few minutes with our next interview.

Published Date : Jun 7 2017

SUMMARY :

covering Spark Summit 2017, brought to you by databricks. I didn't know the camera was on. And just to his left we also have Jennifer Wu, I'm happy to be here, too. Mark, do you want to get started? and being able to do ETL-Like workloads, and you mentioned it's for exploratory data science. And the other one was if you were to bring them all together and manage the lifecycle across teams, you know? and so that allows you to move your development machine the domain expert to, you know, I can't speak for the road map in that sense, and talk about Altus a little bit. to build it? on Cloudera in the cloud, and they'll be able to do things a lot of the infrastructure that you would associate with We know the application, and we want to make sure Maybe start with so that the data that you use for your entire data lake and you can run PySpark in the future, and what new applications you would tackle? or the existing ones could rewritten to make use the idea that you would build sort of operational apps Again, at the core of it is the part that now you have That's part of sort of the ultimate plan? that you thing should be shared more broadly So if you saw the keynote yesterday, you would see that and the response in terms of all of this on the Cube with us today.

ENTITIES

Entity	Category	Confidence
Jennifer	PERSON	0.99+
Mark Grover	PERSON	0.99+
Jennifer Wu	PERSON	0.99+
Ali Ghodsi	PERSON	0.99+
George	PERSON	0.99+
Mark	PERSON	0.99+
April	DATE	0.99+
Ronald Chin	PERSON	0.99+
San Francisco	LOCATION	0.99+
Matei Zaharia	PERSON	0.99+
30-second	QUANTITY	0.99+
Cloudera	ORGANIZATION	0.99+
Dupe Application Architectures	TITLE	0.99+
dozens	QUANTITY	0.99+
Python	TITLE	0.99+
yesterday	DATE	0.99+
Two questions	QUANTITY	0.99+
today	DATE	0.99+
Spark	TITLE	0.99+
Amazon	ORGANIZATION	0.99+
two experts	QUANTITY	0.99+
a day and a half	QUANTITY	0.99+
First	QUANTITY	0.99+
one problem	QUANTITY	0.99+
Python 2.6	TITLE	0.99+
Strata London	LOCATION	0.99+
one piece	QUANTITY	0.99+
first	QUANTITY	0.98+
Spark Summit 2017	EVENT	0.98+
Cloudera Altus	TITLE	0.98+
Scala	TITLE	0.98+
Docker	TITLE	0.98+
One	QUANTITY	0.97+
Kudu	ORGANIZATION	0.97+
one millisecond	QUANTITY	0.97+
PySpark	TITLE	0.96+
R	TITLE	0.95+
one	QUANTITY	0.95+
two weeks ago	DATE	0.93+
Data Science Workbench	TITLE	0.92+
Cloudera	TITLE	0.91+
hundreds	QUANTITY	0.89+
Hbase	TITLE	0.89+
each	QUANTITY	0.89+
24 different open source components	QUANTITY	0.89+
few months ago	DATE	0.89+
single	QUANTITY	0.88+
kernel	TITLE	0.88+
Altus	TITLE	0.88+

Jim Campigli, WANdisco - #BigDataNYC 2015 - #theCUBE

>> Live from New York. It's The Cube, covering Big Data NYC 2015. Brought to you by Horton Works, IBM, EMC, and Pivotal. Now for your hosts, John Furrier and Dave Vellante. >> Hello, everyone. Welcome back to live in New York City for the Cube. A special big data [inaudible 00:00:27] our flagship program will go out to the events. They expect a [Inaudible 00:00:30] We are here live as part of Strata Hadoop Big Data NYC. I'm John Furrier. My co-host, Dave Vellante. Our next guest is Jim Campigli, the Chief Product Officer at WANdisco. Welcome back to The Cube. Great to see you. >> Thanks, great to be here. >> You've been COO of WANdisco, head of marketing, now Chief Product Officer for a few years. You guys have always had the patent. David was on earlier. I asked him specifically, why doesn't the other guys just do what you do? I wanted you to comment deeper on that because he had a great answer. He said, patents. But you guys do something that's really hard that people can't do. >> Right. >> So let's get into it because Fusion is a big announcement you guys made. Big deal with EMC, lot of traction with that, and it's one of these things that is kind of talked about, but not talked about. It's really a big deal, so what is the reason why you guys are so successful on the product side? >> Well I think, first of all, it starts with the technology that we have patented, and it's this true active active replication capability that we have. Other software products claim to have active active replication, but when you drill down on what they're really doing, typically, what's happening is they'll have a set of servers that they replicate across, and you can write a transaction at any server, but then that server is responsible for propagating it to all of the other servers in the implementation. There's no mechanism for pre-agreeing to that transaction before it's actually written, so there's no way to avoid conflicts up front, there's no way to effectively handle scenarios where some of the servers in the implementation go down while the replication is in process, and very frequently, those solutions end up requiring administrators to do periodic resynchronization, go back and manually find out what didn't take, and deal with all the deltas, whereas we offer guaranteed consistency. And effectively what happens is with us, you can write at any server as well, but the difference is we go through a peer-to-peer agreement process, and once a quorum of the servers in the implementation agree to the transaction, they all accept it, and we make sure everything is written in the same order on every server. And every server knows the last good transaction it processed, so if it goes down at some point in time, as soon as it comes back up, it can grab all the transactions it missed during that time slice while it was offline, resync itself automatically without an administrator having to do anything. And you can use that feature not only for network and server outages that cause downtime, but even for planned maintenance, which is one of the biggest causes of Hadoop availability issues, because obviously if you've got a global appointment, when it's midnight on Sunday in the U.S., it's the start of the business day on Monday in Europe, and then it's the middle of the afternoon in Asia. So if you take Hadoop clusters down, somebody somewhere in the world is going to be going without their applications and data. >> It's interesting; I want to get your comments on this because this has a great highlight into the next conversation we've been hearing all throughout The Cube this week is analytics, outcomes. These are the kind of things that people talk about because that means there's checks being written. Hadoop is moving into production. People have done the clusters. It used to be the conversation, hey, x number of clusters, you do this, you do that, replication here and there, YARN, all these different buzz words. Really feeds and speeds. Now, Hadoop is relevant, but it's kind of invisible. It's under the hood. >> Right. >> Yet, it's part of other things in the network, so high availability, non-disruptive operations, is what our table stakes now. So I want you to talk about that nuance because that's what we're seeing as the things that are powering, as the engine of Hadoop deployments. What is that? Take us through that nuance, because that's one of the things that you guys have been doing a lot of work in that's making it reliable and stable. To actually go out and play with Hadoop, deploy it, make sure it's always on. >> Well, we really come into play when companies are moving Hadoop out of the lab and into production. When they have defined application SLAs, when they can only have so much down time, and it may be business requirements, it may be regulatory compliance issues, for example, financial services. They pretty much always have to have their data available. They have to have a solid back-up of the data. That's a hard requirement for them to put anything into production in their data centers. >> The other use case we've been hearing is okay, I've got Hadoop, I've been playing with it, now I need to scale it up big time. I need to double, triple my clusters. I have to put it with my applications. Then the conversation's, okay, wait, do I need to do more cis admin work? How do you address that particular piece because I think that's where I think Fusion comes in from how I'm reading it, but is that a Fusion value proposition? Is it a WANdisco thing, and what does the customer, and is that happening? >> Yeah, so there's actually two angles to that, and the first is how do we maintain that up-time? How do we make sure there's performance availability to meet the SLA's, the production SLA's? The active active replication that we have patents for, that I described earlier, and it's embodied in our discount distributed coordination engine, is at the core of Fusion, and once a Fusion server's installed with each of your Hadoop clusters, that active active replication capability is extended to them, and we expose that HDFS API so the client applications, Sqoop, Flume, Impala, HIVE, anything that would normally run against a Hadoop cluster, would talk through us. If it's been defined for replication, we do the active active replication of it. Pass straight through and process normally on the local cluster. So how does that address the issues you were talking about? What you're getting by default with our active active replication is effectively continuous hot back-up. That means if one cluster or an entire data center goes offline, that data exists elsewhere. Your users can fail over. They can continue accessing the data, running their applications. As soon as that cluster comes back online, it resyncs automatically. Now what's the other >> No user involvement? No admin? >> No user involvement in that. Now the only time, and this gets back into what I was talking about earlier, if I take servers offline for planned maintenance, upgrade the hardware, the operating system, whatever it may be, I can take advantage of that feature, as I was alluding to earlier. I can take the servers of the entire cluster offline, and Fusion knows the last good transactions that were processed on that cluster. As soon as the admin turns it back on, it'll resync itself automatically. So that's how you avoid down time, even for planned maintenance, if you have to take an entire location off. Now, to your other question, how do you scale this stuff up? Think about what we do. We eliminate idle standby hardware, because everything is full read write. You don't have standby read-only back-up clusters and servers when we come into the picture. So let's say we walk into an existing implementation, and they've got two clusters. One is the active cluster where everything's being written to, read from, actively being accessed by users. The other's just simply taking snapshots or periodic back-ups, or they're using dis(CP) or something else, but they really can't get full utilization out of that. We come in with our active active replication capability, and they don't have to change anything, but what suddenly happens is, as soon as they define what they want replicated, we'll replicate it for them initially to the other clusters. They don't have to pre-sync it, and the cluster that was formally for disaster recovery, for back-up, is now live and fully usable. So guess what? I'm now able to scale up to twice my original implementation by just leveraging that formally read-only back-up cluster that I was >> Is there a lot of configuration involved in that, or is it automatically? >> No, so basically what happens, again, you don't have to synchronize the clusters in advance. The way we replicate is based on this concept of folders, and you can think of a folder as basically a collection of files and subdirectories that roll up into root directories, effectively, that reflect typically particular applications that people are using with Hadoop or groups of users that have data sets that they access for their various sets of applications. And you define the replicated folders, basically a high level directory that consists of everything in it, and as soon as you do that, what we'll do automatically, in a new implementation. Let's keep it simple. Let's say you just have two clusters, two locations. We'll replicate that folder in its entirety to the target you specify, and then from that point on, we're just moving the deltas over the wire. So you don't have to do anything in advance. And then suddenly that back-up hardware is fully usable, and you've doubled the size of your implementations. You've scaled up to 2x. >> So, I mean what you're describing before, really strikes me that the way you tell the complexity of a product and the value of a product in this space is what happens when something goes wrong. >> Yep. >> That's the question you always ask. How do you recover, because recovery's a very hard thing, and your patents, you've got a lot of math inside there. >> Right. >> But you also said something that's interesting, which is you're an asset utilization play. >> Right. >> You're being able to go in relatively simply and say, okay, you've got this asset that's underutilized. I'm now going to give you back some capacity that's on the floor and take advantage of that. >> Right, and you're able to scale up without spending any more on hardware and infrastructure. >> So I'm interested in, so another company. You're now with an EMC partnership this week. And they sort of got into this way back in the mainframe days with SRDF. I always thought when I first heard about WANdisco, it's like SRDF for Hadoop, but it's active active. Then they bought that Yada Yada. >> And there's no distance limitations for their active active. >> So what's the nature of the relationship with EMC? >> Okay, so basically EMC, like the other storage vendors that want to play in the Hadoop space, expose some form of an HDFS API, and in fact, if you look at Hortonworks or Cloudera, if you go and look at Cloudera Manager, one of the things it asks you when you're installing it is are you going to run this on regular HDFS storage, effectively a bunch of commodity boxes typically, or are you going to use EMC Isilon or the various other options? And what we're able to do is replicate across Hadoop clusters running on Isilon, running on EMC ECS, running on standard HDFS, and what that allows these companies to do is without modifying those storage systems, without migrating that data off of them, incorporate it into an enterprise-wide data lake, if that's what they want to do, and selectively replicate across all of those different storage systems. It could be a mix of different Hadoop distributions. You could have replication between C/D/H, HDP, Pivotal, MapR, all of those things, including EMC Storage that I just mentioned, it was mentioned in the press release, Isilon, and ECS effectively has a Hadoop-compatible API support. And we can create in effect a single virtual cluster out of all of those different platforms. >> So is it a go-to-market relationship? Is it an OEM deal? >> Yeah, it was really born out of the fact that we have some mutual customers that want to do exactly what I just described. They have standard Hortonworks or Cloudera deployments in house. They've got data running on Isilon, and they want to deploy a data lake that includes what they've got stored on Isilon with what they've got in HDFS and Hadoop and replicate across that. >> Like onerous EMC certification process? >> Yeah, we went through that process. We actually set up environments in our labs where we had EMC, Isilon, and ECS running and did demonstration integrations, replication across Isilon to HDP to Hortonworks, Isilon to Cloudera, ECS to Isilon to HDP and Cloudera and so forth. So we did prove it out. They saw that. In fact, they lent us boxes to actually do this in our labs, so they were very motivated, and they're seeing us in some of their bigger accounts. >> Talk about the aspect of two things: non-disruptive operations, meaning I have to want to deploy stuff because now that Hadoop has a hardened top with some abstraction layer, with analytics to focus, there's a lot of work going on under the hood, and a large scale enterprise might have a zillion versions of Hadoop. They might have little Hortonworks here. They might have something over here, so there might be some diversity in the distributions. That's one thing. The other one is operational disruption. >> Right. >> What do you guys do there? Is it zero disruption, and how do you deal with multiple versions of the distro? >> Okay, so basically what we're doing, the simplest way to describe it is we're providing a common API across all of these different distributions, running on different storage platforms and so forth, so that the client applications are always interacting with us. They're not worrying about the nuances of the particular Hadoop API's that these different things expose. So we're providing a layer of abstraction effectively. So we're transparent in effect, in that sense, operationally, once we're installed. The other thing is, and I mentioned this earlier, we come in, basically, you don't have to pre-sync clusters, you don't have to make sure they're all the same versions or the same distros or any of that, just install us, select the data that you want to replicate, we'll replicate it over initially to the target clusters, and then from that point on, you just go. It just works, and we talked about the core patent for active active replication. We've got other patents that have been approved, three patents now and seven pending applications pending, that allow this active active replication to take place while servers are being added and removed from implementations without disrupting user access or running applications and so forth. >> Final question for you, sum up the show this week. What's the vibe here? What's the aroma? Is it really Hadoop next? What is the overall Big Data NYC story here in Strata Hadoop? What's the main theme that you're seeing coming out of the show? >> I think the main theme that we're starting to see, it's twofold. I think one is we are seeing more and more companies moving this into production. There's a lot of interest in Spark and the whole fast data concept, and I don't think that Spark is necessarily orthogonal to Hadoop at all. I think the two have to coexist. If you think about Spark streaming and the whole fast data concept, basically, Hadoop provides the historical data at rest. It provides the historical context. The streaming data provides the point in time information. What Spark together with Hadoop allows you to do is that real time analysis, do the real time informed decision-making, but do it within historical context instead of a single point in time vacuum. So I think what's happening, and you notice the vendors themselves aren't saying, oh it's all Spark, forget Hadoop. They're really talking about coexisting. >> Alright, Jim, from WANdisco, Chief Product Officer, really in the trenches, talking about what's under the hood and making it all scale in the infrastructure so his analysts can hit the scene. Great to see you again. Thanks for coming and sharing your insight here on The Cube. Live in New York City. We are here, day two of three days of wall-to-wall coverage of Big Data NYC in conjunction with Strata. We'll be right back with more live coverage in the moment here in New York City after this short break.

Published Date : Oct 6 2015

SUMMARY :

Brought to you by Horton New York City for the Cube. You guys have always had the patent. on the product side? and once a quorum of the servers These are the kind of things because that's one of the things back-up of the data. and is that happening? So how does that address the issues and the cluster that was and you can think of a folder really strikes me that the way you tell That's the question you always ask. But you also said that's on the floor and Right, and you're able to scale up in the mainframe days with SRDF. And there's no distance limitations one of the things it asks you born out of the fact and Cloudera and so forth. diversity in the distributions. so that the client applications What is the overall Big Data NYC story and the whole fast data concept, in the infrastructure

ENTITIES

Entity	Category	Confidence
David	PERSON	0.99+
Jim	PERSON	0.99+
Jim Campigli	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Europe	LOCATION	0.99+
WANdisco	ORGANIZATION	0.99+
EMC	ORGANIZATION	0.99+
Asia	LOCATION	0.99+
U.S.	LOCATION	0.99+
New York	LOCATION	0.99+
John Furrier	PERSON	0.99+
Horton Works	ORGANIZATION	0.99+
IBM	ORGANIZATION	0.99+
John Furrier	PERSON	0.99+
New York City	LOCATION	0.99+
two locations	QUANTITY	0.99+
Strata Hadoop	TITLE	0.99+
first	QUANTITY	0.99+
Pivotal	ORGANIZATION	0.99+
one	QUANTITY	0.99+
two things	QUANTITY	0.99+
Hortonworks	ORGANIZATION	0.99+
Hadoop	TITLE	0.99+
One	QUANTITY	0.99+
two	QUANTITY	0.99+
two clusters	QUANTITY	0.99+
three days	QUANTITY	0.99+
Monday	DATE	0.99+
three patents	QUANTITY	0.98+
this week	DATE	0.98+
seven pending applications	QUANTITY	0.98+
two angles	QUANTITY	0.98+
two clusters	QUANTITY	0.98+
Spark	TITLE	0.97+
this week	DATE	0.97+
one cluster	QUANTITY	0.97+
00:00:30	DATE	0.95+
ECS	TITLE	0.95+
HDP	ORGANIZATION	0.94+
Cloudera Manager	TITLE	0.94+
single point	QUANTITY	0.94+
#BigDataNYC	EVENT	0.94+
each	QUANTITY	0.94+
Impala	TITLE	0.93+
NYC	LOCATION	0.93+
twofold	QUANTITY	0.93+
Strata	ORGANIZATION	0.92+
Flume	TITLE	0.92+
00:00:27	DATE	0.92+
Sqoop	TITLE	0.92+
Fusion	TITLE	0.91+
Isilon	ORGANIZATION	0.89+
Cloudera	ORGANIZATION	0.89+
midnight	DATE	0.89+
Sunday	DATE	0.88+
Isilon	TITLE	0.88+
single	QUANTITY	0.88+
HIVE	TITLE	0.87+
one thing	QUANTITY	0.83+
double	QUANTITY	0.83+

Steve Wooledge - HP Discover Las Vegas 2014 - theCUBE - #HPDiscover

>>Live from Las Vegas, Nevada. It's a queue at HP. Discover 2014 brought to you by HP. >>Welcome back, everyone live here in Las Vegas for HP. Discover 2014. This is the cube we're out. We go where the action is. We're on the ground here at HP. Discover getting all the signals, sharing them with you, extracting the signal from the noise. I'm John furrier, founder of SiliconANGLE. I joined Steve Woolwich VP of product marketing at map art technologies. Great to see you welcome to the cube. Thank you. I know you got a plane to catch up, but I really wanted to squeeze you in because you guys are a leader in the big data space. You guys are in the top three, the three big whales map are Hortonworks, Cloudera. Um, you know, part of the original big data industry, which, you know, when we did the cube, when we first started the industry, you had like 30, 34 employees, total combined with three, one company Cloudera, and then Matt are announced and then Hortonworks, you guys have been part of that. Holy Trinity of, of early pioneers. Give us the update you guys are doing very, very well. Uh, we talked to you guys at the dupe summit last week. So Jack Norris for the party, give us the update what's going on with the momentum and the traction. And then I want to talk about some of the things with the product. >>Yeah. So we've seen a tremendous uptick in sales at map. Are we tripled revenue? We announced that publicly about a month ago. So we went up 300% in sales, over Q3, I'm sorry, Q1 of 2013. And I think it's really, you know, the maturity of the market. As people move more towards production, they appreciate the enterprise features. We built into the map, our distribution for Hadoop. So, um, you know, the stats I would share is that 80% of our customers triple the size of their cluster within the first 12 months and 50% of them doubled the size of the cluster because there's the, you know, they had that first production success use case and they find other applications and start rolling out more and more. So it's been great for us. >>You know, I always joke with Jack Norris, who's the VP of marketing over there. And John Frodo is the CEO about Matt bars, humbleness. You don't have the fanfare of all the height, depressed love cloud era. Now see they had done some pretty amazing things. They've had a liquidity event, so essentially kind of an IPO, if you will, that huge ex uh, financing from Intel and they're doing great big Salesforce. Hortonworks has got their open source play. You guys got, you got your heads down as well. So talk about that. How many employees you guys have and what's going on with the product? How many, how many new, what, how many products do you guys actually, >>We have, well, we have one product. So we have the map, our distribution for Hadoop, and it's got all the open source packages directly within it, but where we really innovate is in the course. So that's where we, we spent our time early on was really innovating that data platform to give everything within the Hadoop ecosystem, more reliability, better availability, performance, security scale, >>It's open source contributions to the court. And you guys put stuff on top of that, uh, >>And how it works. Yeah. And even some projects we lead the projects like with Apache Mahal and Apache drill, which is coming into beta shortly other projects, we commit and contribute back. But, um, so we take in the distribution, we're distributing all those projects, but where we really innovate is at that data platform level. So >>HP is a big data leader officer. They bought, uh, autonomy. They have HP Vertica. You guys are here. Hey, what are you doing here? Obviously we covered the cube, uh, the announcement with, uh, with, with HP Vertica, you here for that reason, is there other biz dev other activity going on other integration opportunities? >>Yeah, a few things. So, um, obviously the HP Vertica news was big. We went into general availability that solution the first week of may. So, um, what we have is the HP Vertica database integrated directly on top of our data platform. So it's this hybrid solution where you have full SQL database directly within your Hadoop distribution. Um, so it had a couple sessions on that. We had, uh, a nice panel discussion with our friends from Cloudera and Hortonworks. So really good discussion with HP about just the ecosystem and how it's evolving. The other things we're doing with HP now is, you know, we've got reference architectures on their hardware lines. So, um, you know, people can deploy Mapbox on the hardware of HP, but then also we're talking with the, um, the autonomy group about enterprise search and looking at a similar type of integration where you could have the search integrated directly into your Hadoop distro. And we've got some joint accounts we're piloting that she goes, now, >>You guys are integrating with HP pretty significantly that deals is working well. Absolutely. What's the coolest thing that you've seen with an HP that you can share. How so I asked you in the big data landscape, everyone's Bucher, you know, hunkering down, working on their feature, but outside in the real world, big data, it's not on the top of mind of the CIO, 24 7. It's probably an item that they're dressing. What have you seen and what have you been most impressed with at HP here? >>Yeah. Say, you know, this is my first HP event like this. I think the strategy they have is really good. I think in certain areas like the cloud in particular with the helium, I think they made a lot of early investments there and place some bets. And I think that's going to pay off well for them. And that marries pretty nicely with our strategy as well in terms of, you know, we have on-premise deployments, but we're also an OEM if you will, within Amazon web services. So we have a lot of agility in the cloud if you will. And I think as those products and the partnerships with HP, evolvable, we'll be playing a lot more with them in the cloud as well. >>I see that asks you a question. I want you to share with the folks out there in your own words, what is it about map bar that they may or may not understand or might not know about? Um, a little humble brag out there and share some, share some, uh, insight of, into, into map bar for folks that don't know you guys as a company and for the folks that may have a misperception of what you guys do shit share with them, with what, what map map is all about. >>Yeah. I mean, for me, I was in this space with Aster data and kind of the whole Hadoop and MapReduce area since 2008 and pretty familiar with everybody in the space. I really looked at Matt bars, the best technology hands down, you look at the Forrester wave and they rank us as having the best technology today, as well as product roadmap. I think the misperception is people think, oh, it's proprietary and close. It's actually the opposite of that. We have an unbiased open-source approach where we'll ship in support in our distribution, in the entire Apache spark stack. We're not selective over which projects within Apache spark. We support. Um, I feel like SQL on Hadoop. We support Impala as well as hive and other SQL on to do technologies, including the ability to integrate HP Vertica directly in the system. And it's because of the openness of our platform. I'd say it's actually more open because of the standards we've integrated into the data platform to support a lot of third-party tools directly within it. So there is no locked in the storage formats are all the same. The code that runs on top of the distribution from the projects is exactly the same. So you can build a project in hive or some other system, and you can port it between any of the distributions. So there isn't a, lock-in >>The end of the day, what the customers want is they want ease of integration. They want reliability. That's right. And so what are you guys working on next? What's the big, uh, product marketing roadmap that you can share with us? >>Yeah, I think for us, because of the innovations we did in the data platform allows us to support not only more applications, but more types of operational systems. So integrating things like fraud detection and recommendation engines directly with the analytical systems to really speed up that, um, accuracy and, and, uh, in targeting and detecting risk and things like that. So I think now over time, you know, Hadoop has sort of been this batch analytic type of platform, but the ability to converge operations and analytics in one system is really going to be enabled by technology like Matt BARR. >>How many employees do you guys have now? Uh, >>I'm not sure what our CFO would. Let me say that before. You can say we're over 200 at this point >>As well. And over five, the customers which got the data, you guys do summit graduations, we covered your relationship with HP during our big data SV. That was exciting. Good to see John Schroeder, big, very impressive team. I'm impressed with map. I will always have been. You guys have Stephanie kept your knitting saved. Are you going to do, and again, leading the big data space, um, and again, not proprietary is a very key word and that's really cool. So thanks for coming on. Like you really appreciate Steve. We'll be right back. This is the cube live in Las Vegas, extracting the city from the noise with map bar here at the HP discover 2014. We'll be right back here for the short break.

Published Date : Jun 12 2014

SUMMARY :

Discover 2014 brought to you by HP. Uh, we talked to you guys at the dupe summit last week. So, um, you know, the stats You guys got, you got your heads down as well. and it's got all the open source packages directly within it, but where we really innovate is in the course. And you guys put stuff on top of that, But, um, so we take in the distribution, we're distributing all those projects, but where we really innovate is uh, the announcement with, uh, with, with HP Vertica, you here for that reason, is there other biz dev other activity So it's this hybrid solution where you have full SQL How so I asked you in the big data landscape, everyone's Bucher, So we have a lot of agility in the cloud if you will. into map bar for folks that don't know you guys as a company and for the folks that may have a misperception of what you So you can build a project in hive or some What's the big, uh, product marketing roadmap that you can So I think now over time, you know, Hadoop has sort of been this batch analytic Let me say that before. And over five, the customers which got the data, you guys do summit graduations,

ENTITIES

Entity	Category	Confidence
John Schroeder	PERSON	0.99+
Steve Woolwich	PERSON	0.99+
Steve	PERSON	0.99+
Jack Norris	PERSON	0.99+
HP	ORGANIZATION	0.99+
John Frodo	PERSON	0.99+
three	QUANTITY	0.99+
80%	QUANTITY	0.99+
Steve Wooledge	PERSON	0.99+
50%	QUANTITY	0.99+
John furrier	PERSON	0.99+
Las Vegas	LOCATION	0.99+
Matt BARR	PERSON	0.99+
Hortonworks	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
Cloudera	ORGANIZATION	0.99+
Stephanie	PERSON	0.99+
30	QUANTITY	0.99+
300%	QUANTITY	0.99+
first	QUANTITY	0.99+
last week	DATE	0.99+
Aster	ORGANIZATION	0.99+
2008	DATE	0.98+
Q1	DATE	0.98+
Las Vegas, Nevada	LOCATION	0.98+
one product	QUANTITY	0.98+
34 employees	QUANTITY	0.98+
one system	QUANTITY	0.98+
evolvable	ORGANIZATION	0.98+
over five	QUANTITY	0.97+
SQL	TITLE	0.97+
three big whales	QUANTITY	0.97+
MapReduce	ORGANIZATION	0.96+
SiliconANGLE	ORGANIZATION	0.96+
first 12 months	QUANTITY	0.95+
Apache Mahal	ORGANIZATION	0.95+
map map	ORGANIZATION	0.95+
over 200	QUANTITY	0.95+
24	OTHER	0.94+
today	DATE	0.94+
Intel	ORGANIZATION	0.92+
Matt	PERSON	0.92+
Salesforce	ORGANIZATION	0.91+
2014	DATE	0.9+
Impala	TITLE	0.9+
Hadoop	ORGANIZATION	0.89+
HP Vertica	ORGANIZATION	0.89+
map bar	ORGANIZATION	0.89+
Hadoop	TITLE	0.86+
one company	QUANTITY	0.85+
dupe summit	EVENT	0.84+
about a month ago	DATE	0.83+
Bucher	PERSON	0.81+
Discover 2014	EVENT	0.78+
first week of may	DATE	0.77+
Apache drill	ORGANIZATION	0.74+
#HPDiscover	ORGANIZATION	0.73+
Mapbox	TITLE	0.73+
2013	DATE	0.72+
SQL on	TITLE	0.7+
art technologies	ORGANIZATION	0.63+
Apache	ORGANIZATION	0.61+

Jack Norris - Hadoop Summit 2014 - theCUBE - #HadoopSummit

>>The queue at Hadoop summit, 2014 is brought to you by anchor sponsor Hortonworks. We do, I do. And headline sponsor when disco we make Hadoop invincible >>Okay. Welcome back. Everyone live here in Silicon valley in San Jose. This is a dupe summit. This is Silicon angle and Wiki bonds. The cube is our flagship program. We go out to the events and extract the signal to noise. I'm John barrier, the founder SiliconANGLE joins my cohost, Jeff Kelly, top big data analyst in the, in the community. Our next guest, Jack Norris, COO of map R security enterprise. That's the buzz of the show and it was the buzz of OpenStack summit. Another open source show. And here this year, you're just seeing move after, move at the moon, talking about a couple of critical issues. Enterprise grade Hadoop, Hortonworks announced a big acquisition when all in, as they said, and now cloud era follows suit with their news. Today, I, you sitting back saying, they're catching up to you guys. I mean, how do you look at that? I mean, cause you guys have that's the security stuff nailed down. So what Dan, >>You feel about that now? I think I'm, if you look at the kind of Hadoop market, it's definitely moving from a test experimental phase into a production phase. We've got tremendous customers across verticals that are doing some really interesting production use cases. And we recognized very early on that to really meet the needs of customers required some architectural innovation. So combining the open source ecosystem packages with some innovations underneath to really deliver high availability, data protection, disaster recovery features, security is part of that. But if you can't predict the PR protect the data, if you can't have multitenancy and separate workflows across the cluster, then it doesn't matter how secure it is. You know, you need those. >>I got to ask you a direct question since we're here at Hadoop summit, because we get this question all the time. Silicon lucky bond is so successful, but I just don't understand your business model without plates were free content and they have some underwriters. So you guys have been very successful yet. People aren't looking at map are as good at the quiet leader, like you doing your business, you're making money. Jeff. He had some numbers with us that in the Hindu community, about 20% are paying subscriptions. That's unlike your business model. So explain to the folks out there, the business model and specifically the traction because you have >>Customers. Yeah. Oh no, we've got, we've got over 500 paying customers. We've got at least $1 million customer in seven different verticals. So we've got breadth and depth and our business model is simple. We're an enterprise software company. That's looking at how to provide the best of open source as well as innovations underneath >>The most open distribution of Hadoop. But you add that value separately to that, right? So you're, it's not so much that you're proprietary at all. Right. Okay. >>You clarify that. Right. So if you look at, at this exciting ecosystem, Hadoop is fairly early in its life cycle. If it's a commoditization phase like Linux or, or relational database with my SQL open source, kind of equates the whole technology here at the beginning of this life cycle, early stages of the life cycle. There's some architectural innovations that are really required. If you look at Hadoop, it's an append only file system relying on Linux. And that really limits the types of operations. That types of use cases that you can do. What map ours done is provide some deep architectural innovations, provide complete read-write file systems to integrate data protection with snapshots and mirroring, et cetera. So there's a whole host of capabilities that make it easy to integrate enterprise secure and, and scale much better. Do you think, >>I feel like you were maybe a little early to the market in the sense that we heard Merv Adrian and his keynote this morning. Talk about, you know, it's about 10 years when you start to get these questions about security and governance and we're about nine years into Hadoop. Do you feel like maybe you guys were a little early and now you're at a tipping point, whereas these more, as more and more deployments get ready to go to production, this is going to be an area that's going to become increasingly important. >>I think, I think our timing has been spectacular because we, we kind of came out at a time when there was some customers that were really serious about Hadoop. We were able to work closely with them and prove our technology. And now as the market is just ramping, we're here with all of those features that they need. And what's a, what's an issue. Is that an incremental improvement to provide those kind of key features is not really possible if the underlying architecture isn't there and it's hard to provide, you know, online real-time capabilities in a underlying platform that's append only. So the, the HDFS layer written in Java, relying on the Linux file system is kind of the, the weak underbelly, if you will, of, of the ecosystem. There's a lot of, a lot of important developments happening yarn on top of it, a lot of really kind of exciting things. So we're actively participating in including Apache drill and on top of a complete read-write file system and integrated Hindu database. It just makes it all come to life. >>Yeah. I mean, those things on top are critical, but you know, it's, it's the underlying infrastructure that, you know, we asked, we keep on community about that. And what's the, what are the things that are really holding you back from Paducah and production and the, and the biggest challenge is they cited worth high availability, backup, and recovery and maintaining performance at scale. Those are the top three and that's kind of where Matt BARR has been focused, you know, since day one. >>So if you look at a major retailer, 2000 nodes and map bar 50 unique applications running on a single cluster on 10,000 jobs a day running on top of that, if you look at the Rubicon project, they recently went public a hundred million add actions, a hundred billion ad auctions a day. And on top of that platform, beats music that just got acquired for $3 billion. Basically it's the underlying map, our engine that allowed them to scale and personalize that music service. So there's a, there's a lot of proof points in terms of how quickly we scale the enterprise grade features that we provide and kind of the blending of deep predictive analytics in a batch environment with online capabilities. >>So I got to ask you about your go to market. I'll see Cloudera and Hortonworks have different business models. Just talk about that, but Cloudera got the massive funding. So you get this question all the time. What do you, how do you counter that army and the arms race? I think >>I just wrote an article in Forbes and he says cash is not a strategy. And I think that was, that was an excellent, excellent article. And he goes in and, you know, in this fast growing market, you know, an amount of money isn't necessarily translate to architectural innovations or speeding the development of that. This is a fairly fragmented ecosystem in terms of the stack that runs on top of it. There's no single application or single vendor that kind of drives value. So an acquisition strategy is >>So your field Salesforce has direct or indirect, both mixable. How do you handle the, because Cloudera has got feet on the street and every squirrel will find it, not if they're parked there, parking sales reps and SCS and all the enterprise accounts, you know, they're going to get the, squirrel's going to find a nut once in awhile. Yeah. And they're going to actually try to engage the clients. So, you know, I guess it is a strategy if they're deploying sales and marketing, right? So >>The beauty about that, and in fact, we're all in this together in terms of sharing an API and driving an ecosystem, it's not a fragmented market. You can start with one distribution and move to another, without recompiling or without doing any sort of changes. So it's a fairly open community. If this were a vendor lock-in or, you know, then spending money on brand, et cetera, would, would be important. Our focus is on the, so the sales execution of direct sales, yes, we have direct sales. We also have partners and it depends on the geographies as to what that percentage is. >>And John Schroeder on with the HP at fifth big data NYC has updated the HP relationship. >>Oh, excellent. In fact, we just launched our application gallery app gallery, make it very easy for administrators and developers and analysts to get access and understand what's available in the ecosystem. That's available directly on our website. And one of the featured applications there today is an integration with the map, our sandbox and HP Vertica. So you can get early access, try it and get the best of kind of enterprise grade SQL first, >>First Hadoop app store, basically. Yeah. If you want to call it that way. Right. So like >>Sure. Available, we launched with close to 30, 30 with, you know, a whole wave kind of following that. >>So talk a little bit about, you know, speaking of verdict and kind of the sequel on Hadoop. So, you know, there's a lot of talk about that. Some confusion about the different methods for applying SQL on predicts or map art takes an open approach. I know you'll support things like Impala from, from a competitor Cloudera, talk about that approach from a map arts perspective. >>So I guess our, our, our perspective is kind of unbiased open source. We don't try to pick and choose and dictate what's the right open source based on either our participation or some community involvement. And the reality is with multiple applications being run on the platform, there are different use cases that make difference, you know, make different sense. So whether it's a hive solution or, you know, drill drills available, or HP Vertica people have the choice. And it's part of, of a broad range of capabilities that you want to be able to run on the platform for your workflows, whether it's SQL access or a MapReduce or a spark framework shark, et cetera. >>So, yeah, I mean there is because there's so many different there's spark there's, you know, you can run HP Vertica, you've got Impala, you've got hive. And the stinger initiative is, is that whole kind of SQL on Hadoop ecosystem, still working itself out. Are we going to have this many options in a year or two years from now? Or are they complimentary and potentially, you know, each has its has its role. >>I think the major differences is kind of how it deals with the new data formats. Can it deal with self-describing data? Sources can leverage, Jason file does require a centralized metadata, and those are some of the perspectives and advantages say the Apache drill has to expand the data sets that are possible enabled data exploration without dependency on a, on an it administrator to define that, that metadata. >>So another, maybe not always as exciting, but taking workloads from existing systems, moving them to Hadoop is one of the ways that a lot of people get started with, to do whether associated transformation workloads or there's something in that vein. So I know you've announced a partnership with Syncsort and that's one of the things that they focus on is really making it as easy as possible to meet those. We'll talk a little bit about that partnership, why that makes sense for you and, and >>When your customer, I think it's a great proof point because we announced that partnership around mainframe offload, we have flipped comScore and experience in that, in that press release. And if you look at a workload on a mainframe going to duke, that that seems like that's a, that's really an oxymoron, but by having the capabilities that map R has and making that a system of record with that full high availability and that data protection, we're actually an option to offload from mainframe offload, from sand processing and provide a really cost effective, scalable alternative. And we've got customers that had, had tried to offload from the mainframe multiple times in the past, on successfully and have done it successfully with Mapbox. >>So talk a little bit more about kind of the broader partnership strategy. I mean, we're, we're here at Hadoop summit. Of course, Hortonworks talks a lot about their partnerships and kind of their reseller arrangements. Fedor. I seem to take a little bit more of a direct approach what's map R's approach to kind of partnering and, and as that relates to kind of resell arrangements and things like, >>I think the app gallery is probably a great proof point there. The strategy is, is an ecosystem approach. It's having a collection of tools and applications and management facilities as well as applications on top. So it's a very open strategy. We focus on making sure that we have open API APIs at that application layer, that it's very easy to get data in and out. And part of that architecture by presenting standard file system format, by allowing non Java applications to run directly on our platform to support standard database connections, ODBC, and JDBC, to provide database functionality. In addition to kind of this deep predictive analytics really it's about supporting the broadest set of applications on top of a single platform. What we're seeing in this kind of this, this modern architecture is data gravity matters. And the more processing you can do on a single platform, the better off you are, the more agile, the more competitive, right? >>So in terms of, so you're partnering with people like SAS, for example, to kind of bring some of the, some of the analytic capabilities into the platform. Can you kind of tell us a little bit about any >>Companies like SAS and revolution analytics and Skytree, and I mean, just a whole host of, of companies on the analytics side, as well as on the tools and visualization, et cetera. Yeah. >>Well, I mean, I, I bring up SAS because I think they, they get the fact that the, the whole data gravity situation is they've got it. They've got to go to where the data is and not have the data come to them. So, you know, I give them credit for kind of acknowledging that, that kind of big data truth ism, that it's >>All going to the data, not bringing the data >>To the computer. Jack talk about the success you had with the customers had some pretty impressive numbers talking about 500 customers, Merv agent. The garden was on with us earlier, essentially reiterating not mentioning that bar. He was just saying what you guys are doing is right where the puck is going. And some think the puck is not even there at the same rink, some other vendors. So I gotta give you props on that. So what I want you to talk about the success you have in specifically around where you're winning and where you're successful, you guys have struggled with, >>I need to improve on, yeah, there's a, there's a whole class of applications that I think Hadoop is enabling, which is about operations in analytics. It's taking this, this higher arrival rate machine generated data and doing analytics as it happens and then impacting the business. So whether it's fraud detection or recommendation engines, or, you know, supply chain applications using sensor data, it's happening very, very quickly. So a system that can tolerate and accept streaming data sources, it has real-time operations. That is 24 by seven and highly available is, is what really moves the needle. And that's the examples I used with, you know, add a Rubicon project and, you know, cable TV, >>The very outcome. What's the primary outcomes your clients want with your product? Is it stability? And the platform has enabled development. Is there a specific, is there an outcome that's consistent across all your wins? >>Well, the big picture, some of them are focused on revenues. Like how do we optimize revenue either? It's a new data source or it's a new application or it's existing application. We're exploding the dataset. Some of it's reducing costs. So they want to do things like a mainframe offload or data warehouse offload. And then there's some that are focused on risk mitigation. And if there's anything that they have in common it's, as they moved from kind of test and looked at production, it's the key capabilities that they have in enterprise systems today that they want to make sure they're in Hindu. So it's not, it's not anything new. It's just like, Hey, we've got SLS and I've got data protection policies, and I've got a disaster recovery procedure. And why can't I expect the same level of capabilities in Hindu that I have today in those other systems. >>It's a final question. Where are you guys heading this year? What's your key objectives. Obviously, you're getting these announcements as flurry of announcements, good success state of the company. How many employees were you guys at? Give us a quick update on the numbers. >>So, you know, we just reported this incredible momentum where we've tripled core growth year over year, we've added a tremendous amount of customers. We're over 500 now. So we're basically sticking to our knitting, focusing on the customers, elevating the proof points here. Some of the most significant customers we have in the telco and financial services and healthcare and, and retail area are, you know, view this as a strategic weapon view, this is a huge competitive advantage, and it's helping them impact their business. That's really spring our success. We've, you know, we're, we're growing at an incredible clip here and it's just, it's a great time to have made those calls and those investments early on and kind of reaping the benefits. >>It's. Now I've always said, when we, since the first Hadoop summit, when Hortonworks came out of Yahoo and this whole community kind of burst open, you had to duke world. Now Riley runs at it's a whole different vibe of itself. This was look at the developer vibe. So I got to ask you, and we would have been a big fan. I mean, everyone has enough beachhead to be successful, not about map arbors Hortonworks or cloud air. And this is why I always kind of smile when everyone goes, oh, Cloudera or Hortonworks. I mean, they're two different animals at this point. It would do different things. If you guys were over here, everyone has their quote, swim lanes or beachhead is not a lot of super competition. Do you think, or is it going to be this way for awhile? What's your fork at some? At what point do you see more competition? 10 years out? I mean, Merv was talking a 10 year horizon for innovation. >>I think that the more people learn and understand about Hadoop, the more they'll appreciate these kind of set of capabilities that matter in production and post-production, and it'll migrate earlier. And as we, you know, focus on more developer tools like our sandbox, so people can easily get experienced and understand kind of what map are, is. I think we'll start to see a lot more understanding and momentum. >>Awesome. Jack Norris here, inside the cube CMO, Matt BARR, a very successful enterprise grade, a duke player, a leader in the space. Thanks for coming on. We really appreciate it. Right back after the short break you're live in Silicon valley, I had dupe December, 2014, the right back.

Published Date : Jun 4 2014

SUMMARY :

The queue at Hadoop summit, 2014 is brought to you by anchor sponsor I mean, cause you guys have that's the security stuff nailed down. I think I'm, if you look at the kind of Hadoop market, I got to ask you a direct question since we're here at Hadoop summit, because we get this question all the time. That's looking at how to provide the best of open source But you add that value separately to So if you look at, at this exciting ecosystem, Talk about, you know, it's about 10 years when you start to get these questions about security and governance and we're about isn't there and it's hard to provide, you know, online real-time And what's the, what are the things that are really holding you back from Paducah So if you look at a major retailer, 2000 nodes and map bar 50 So I got to ask you about your go to market. you know, in this fast growing market, you know, an amount of money isn't necessarily all the enterprise accounts, you know, they're going to get the, squirrel's going to find a nut once in awhile. We also have partners and it depends on the geographies as to what that percentage So you can get early If you want to call it that way. a whole wave kind of following that. So talk a little bit about, you know, speaking of verdict and kind of the sequel on Hadoop. And it's part of, of a broad range of capabilities that you want So, yeah, I mean there is because there's so many different there's spark there's, you know, you can run HP Vertica, of the perspectives and advantages say the Apache drill has to expand the data sets why that makes sense for you and, and And if you look at a workload on a mainframe going to duke, So talk a little bit more about kind of the broader partnership strategy. And the more processing you can do on a single platform, the better off you are, Can you kind and I mean, just a whole host of, of companies on the analytics side, as well as on the tools So, you know, I give them credit for kind of acknowledging that, that kind of big data truth So what I want you to talk about the success you have in specifically around where you're winning and you know, add a Rubicon project and, you know, cable TV, And the platform has enabled development. the key capabilities that they have in enterprise systems today that they want to make sure they're in Hindu. Where are you guys heading this year? So, you know, we just reported this incredible momentum where we've tripled core and this whole community kind of burst open, you had to duke world. And as we, you know, focus on more developer tools like our sandbox, a duke player, a leader in the space.

ENTITIES

Entity	Category	Confidence
Jeff Kelly	PERSON	0.99+
Jack Norris	PERSON	0.99+
John Schroeder	PERSON	0.99+
HP	ORGANIZATION	0.99+
Jeff	PERSON	0.99+
$3 billion	QUANTITY	0.99+
December, 2014	DATE	0.99+
Jason	PERSON	0.99+
Matt BARR	PERSON	0.99+
10,000 jobs	QUANTITY	0.99+
Today	DATE	0.99+
10 year	QUANTITY	0.99+
Syncsort	ORGANIZATION	0.99+
Dan	PERSON	0.99+
Silicon valley	LOCATION	0.99+
John barrier	PERSON	0.99+
Java	TITLE	0.99+
Yahoo	ORGANIZATION	0.99+
10 years	QUANTITY	0.99+
24	QUANTITY	0.99+
Hadoop	TITLE	0.99+
Cloudera	ORGANIZATION	0.99+
Hortonworks	ORGANIZATION	0.99+
this year	DATE	0.99+
Jack	PERSON	0.99+
fifth	QUANTITY	0.99+
Linux	TITLE	0.99+
Skytree	ORGANIZATION	0.99+
each	QUANTITY	0.99+
both	QUANTITY	0.99+
today	DATE	0.98+
one	QUANTITY	0.98+
Merv	PERSON	0.98+
about 10 years	QUANTITY	0.98+
San Jose	LOCATION	0.98+
Hadoop	EVENT	0.98+
about 20%	QUANTITY	0.97+
seven	QUANTITY	0.97+
over 500	QUANTITY	0.97+
a year	QUANTITY	0.97+
about 500 customers	QUANTITY	0.97+
SQL	TITLE	0.97+
seven different verticals	QUANTITY	0.97+
two years	QUANTITY	0.97+
single platform	QUANTITY	0.96+
2014	DATE	0.96+
Apache	ORGANIZATION	0.96+
Hadoop	LOCATION	0.95+
SiliconANGLE	ORGANIZATION	0.94+
comScore	ORGANIZATION	0.94+
single vendor	QUANTITY	0.94+
day one	QUANTITY	0.94+
Salesforce	ORGANIZATION	0.93+
about nine years	QUANTITY	0.93+
Hadoop Summit 2014	EVENT	0.93+
Merv	ORGANIZATION	0.93+
two different animals	QUANTITY	0.92+
single application	QUANTITY	0.92+
top three	QUANTITY	0.89+
SAS	ORGANIZATION	0.89+
Riley	PERSON	0.88+
First	QUANTITY	0.87+
Forbes	TITLE	0.87+
single cluster	QUANTITY	0.87+
Mapbox	ORGANIZATION	0.87+
map R	ORGANIZATION	0.86+
map	ORGANIZATION	0.86+

Amr Awadallah - Hadoop Summit 2013 - theCUBE - #HadoopSummit

>>Come back here. This is Silicon Valley coverage of ADU Summit. I'm John Fur, the founder. We're, we're pleased to have a friend inside the cube. It's rare to have such luminaries, Ama Aala, good friend and also co-founder of Cloudera. Really the pioneer in the space that helped build this industry that we're living here at at Hadoop Summit. I'm with Dave Ante from wiba.org. Amour, welcome back to the Cube Cub alumni. Thank you for having me here. Wow, what a journey. Are you co-founded Cloudera? I remember when you in Stealth Mo, I really can't talk about it. And, and then of course the history of Silicon Angle being, you know, founded and kind of built in in your office when you only had like 20 something employees. Yep. We owe a great deal of gratitude to you and, and congratulations to you Michael Olson, the team for building an industry. So I just wanted Thank you. Thank you. And welcome to the Cube. >>Thank you. It was great to be here. >>So what do you think, what's your take on the current Hadoop ecosystem right now? I mean, obviously a lot's happened. I mean it's big now. It's growing up fast. Yeah. The word enterprise grade is out there. You're seeing it move from, you know, trying to change the world. Our first interview, you said, I've seen the future, I want to bring it to the mainstream. It's here. Yeah. It's hitting mainstream right now. Yeah. What's your take of the current situation of the ecosystem and it's, and its value? >>Yeah, so I, I have a quick question first. Should I look to you or look to the camera? Look to >>The camera or both? Whatever you, whatever you'd like. >>So I think it's, the ecosystem is definitely growing, which is very, very healthy. However, there is a side question there, which is what do you think of all the competition coming into the space? So five years ago when Cloudera was started was just Cloudera. There was no other commercial vendor trying to support or enable Hadoop in the, in the industry for enterprises. And today there is at least 10 of them trying to compete with us, right? And that includes big companies, established companies that decided, hey, we gonna start addressing the space, but includes many, many newcomers who like Hortonworks, who were founded over the last couple of years. That's a healthy thing. I mean, that's absolutely a sign of a growing market. If the market wasn't growing, if there wasn't money in the market, if there wasn't, if it was just hype, there wouldn't have been all of these new companies and new ventures showing up. That said, I never look at competition as something that worries me, that I'm afraid now or what's gonna happen to me, or that's normal. That's exactly what happens to successful companies. If you look at Red Hat, when Red Hat was launching with the Linux, they had 25 competitors or even more 30 competitors. That's when Red Hat was forming out. And today, even of these 25, 30 competitors, they still have six or seven still left. So I think it's a very, very healthy sign of the graph of this market and the maturity that's reaching. >>What do you think about some of the, the white spaces that are evolving? You guys have obviously been involved in a lot of deployments at Cloudera. Again, you're doing a lot of, lot of work with the top, top names and the clients that you have aren't usually disclosed cuz you really can't disclose them. What, what are you seeing right now as the white spaces for things to do in the Hado platform? >>It's a very, very good question. So first I can't talk about future, future roadmap. Right now we're becoming a big company at that level where we can't comment on future roadmaps. >>Ah, that's sinus sign of the >>Time. You're well media train, good to see they're doing a good job keeping you >>A, You want more information on that? I can connect you with a pt, >>Please. No, no, no, we're good. We're good. We'll get it outta you. But, >>But our vision, our vision for Cloudera from day one, like you were saying earlier, we saw the future, right? So our vision from from day one was really to build this data system where we can have detail of any type, whether that data is structured or unstructured or images, it doesn't matter. And then on top of that data run any type of workloads. That workload could be the initial genesis of Hado, which is map use, which is batch processing. But now as as we made many announcements through the last few years, we also now have Impala for interactive analytics as a workload. We have a very, very strong partner partnership with SaaS for doing machine learning and statistics as a workload. And a few weeks ago we announced search as another workload. So you have multiple types of workloads that can handle different types of problems that you have within your organization and bring all of these workloads to all of your data regardless of type. And that's the vision that we'll continue to deliver on. That's exactly what we're building going into the >>Future. So how's that fit in with yarn, right? We're hearing a lot at this conference about yarn, the ability to, you know, do more with less in a lot of the things that you typically hear with the enter within the enterprise. And, and so talk about that a little bit. >>Yarn is a very core part to our platform. In fact, yarn has been part of CDH four for more than a year now out in the, in the markets. So we did bring, we were one of the, I think we were the first vendor who brought yarn into a distribution of Hado out there. It's very, very fundamental to us because that is how we're gonna coordinate. We are gonna be using yarn to coordinate launching all of these different type of workloads. You're gonna have the map produce workload, which is very batch oriented. The Impala workload, which is very latency sensitive. The, the search workload, which is also very latency sensitive. The machine learning workload, which is more batch oriented, et cetera, et cetera. And yarn is a very, very central piece to helping us coordinate all of these different types of workloads onto the >>Platform. Cloudera has been a great citizen in the community also. You, you mentioned and, and we witnessed that your team create the industry. You guys were there, you took the chance, you were the first ones commercially funded by the venture capitalists, you know, then others will follow and I'll see huge ecosystem here. Yes. A lot of noise. A lot of people trying to get attention. So I got to ask you, because I want you to address this because I know it's been talked about in some of the other blogs is there's a lot of fud going on around who's doing what? Who's doing what, and in some cases maybe flat out, you know, misinformation and that happens in a growing market, you know, the elbows get sharp. Yes. So I want you share with the audience anything that you want say about the fud around what people say about Cloudera or about others or what you're doing. Just to clarify, cuz there has been, I mean I've gotten back channel information around, you know, not sure the committers this, and it's been, it's been well documented. There's a lot of fu out there. What, what would you say to the folks out there to clarify >>That? Yes, I, I would say that our focus should be to continue to work as a community, to push the platform forwards. I would say that at Cloudera we do a lot of contributions. Horton works definitely is one of the top contributors out there as well. I'll acknowledge that. So as many, many, many other companies and we wanna continue to see the platform evolve. I will stress though that at Cloudera we do have a number of the original project founders working at the company. So it's not just the, the contribution that we bring, but the fact that we have the founders of these projects working at Cloudera. And some of these projects actually were created at Cloudera from day one as opposed to created in some other company. And then you hire the employee and they work for you. So I gave you what examples from Cloudera dot cutting. >>He is the creator of Hudu dot Cutting is also the creator of Luine, which became solar, which is part of the search project that we launched recently. Dot Cutting wasn't with Cloudera from day one, right? So, so when he created these technologies, he actually was at Tia for example, when he created had he was at ta, wasn't at Cloudera. However, he now works for Cloudera. So we get that because now that cutting works for Cloudera. So that's one example. On the flip side, there is projects like Flume and Scoop that are now part of every single distribution out there. And flu and Scoop were both created at Calera. They were actually created inside of Cloudera. Yeah. So the key point is, and and that's what I would like all of the vendors out there that are trying to leverage had and get benefit about out Hadoop is please don't be just takers. >>There are some vendors out there who are just takers. Just wanna take from the open source, take from the open source and don't give back. Right? I'm not gonna name them, but there is a few of them out there. Please, please, please. I mean that that, that is very, very a selfish behavior. It's not gonna help the ecosystem in the long term. We would like to see you both take and give at the same time. So that would be my core message. And that's for example, like I thank Hortonworks because that's exactly what Hortonworks is doing. They're both giving and taking at the same >>Time. You guys have always been clear on that. Nobody, I mean here contribution to open source has been well documented and there's, there's no question about that. John and I have talked about it a lot that you guys help get it all started. And even Haak when we had 'em on a couple years ago, when Horton Works came to the market said, Hey, the more people work on an open source, the better. >>Yeah, >>Exactly. So yeah, it's always been, been your posture. You're not playing games there. Anyways, having said that, you you, you have a strategy to layer on top of that open source some of your own proprietary code. And so you have choices to make Yes. In terms of how you allocate those resources. So as an engineering manager, how do you allocate those resources in terms of, okay, what do we do for the community and what do we do for our own, you know, future because of the business model that we chose? How do you make those trade offs? >>Yes, that's a very, very good question. So first it's important to stress that our core platform, CDH, is open source. Everything we put in the core platform is open source. So for example, in Palo, which we launched very recently as a ga, now we launched beta last year, but now's ga is a hundred percent Apache license, a hundred percent open source search, which we announced very recently is also open source. So the platform itself, we're committing to everything in there to be open source. Now we believe fundamentally just from having lots of history in studying the open source markets from our ceo Mike Olson himself being one of the very first open source people in the world with, with sleepy cats, the company that he sold to Oracle before founding Cloudera from our investors, helping many other open source companies. To have a successful open co open source company, you need to have a very good engine between the business model that generates revenue and between the product that you are creating. If you don't have a good feedback loop there between these two, you won't be able to sustain the innovation to continue to push the, the boundaries of how good the product is. So we strongly believe in that if you are, if your product is literally a hundred percent open source, meaning both the management and every, there is nothing proprietary whatsoever inside of your products. I can't tell what that is. It's >>Taking a picture. >>Oh, sorry, I thought somebody was waiting >>For me. >>Sorry about that. >>It's a cheap signal. >>It >>Was like a's really good. >>I thought it's like a card of paper with some writing. You, >>You, you have a fan fans out there. They're storming the, the concert here. >>Okay, that's, that's good to hear. That's good to hear. Sorry about that interruption. So if, if, if you have everything a hundred percent open source, that creates two problems. First you have no differentiation whatsoever, meaning another big corporation without naming who the big corporations could be, we just can take everything you do, literally every single bit of source code you have and say, Hey, we can do it too. Come to us, don't work with those guys. Right? We have the latest, greatest things that they have. Why do you wanna continue to work with them? So no, no differentiation is number one, which is very dangerous. And number two, when it becomes, if, if it's a hundred percent open source and there is lots of other vendors able to take the art, the open source artifact and work with it, then it becomes now purely about maintenance and insurance on the products, which is a commodity product, which obviously the prices for that will go down to the ground and you won't be able to have this sustain this positive feedback effect between your business model and between your product code map and won't be able to build a long-lasting company. >>So that's why we do have a combination of open source artifacts and proprietary artifacts. Now our pro proprietary AR artifacts is always around the management of the system, right? So how do we manage the security of the system? How do we manage the, the data flow within the system? How do we manage the services inside the, of the system across all layers, right? Not just the Hado player but the edge based layer, the zookeeper layer, et cetera, et cetera. So that's where we focus our efforts going forward and that's how we differentiate ourself from our, from other vendors out there. Cloudera manager, Cloudera navigator are very unique to us. Nobody else has anything close to those capabilities out there. >>So it sounds like the contributions you make to open source are cultural of, of, in nature, I mean DNA of sorts of Right. And so you're, that's something that you guys do cuz you've always done it. Absolutely. And then the, the artifacts that are proprietary are essentially around rationalizing the revenue opportunity with the expense that you're gonna apply there and making a business case decided >>How to balance. That's that's one. And then two, the differentiation from other competitors. So these two things, Yes. >>Okay. >>I believe that's fundamental to business to open source business models. >>Yeah, I mean there are many open source business models, right? You can go pure service, you can go, like you said, you can totally bogart the code. >>There is no, there is no pure service open source model company that was able to build the longlasting surviving public company, never happened in history. They always get acquired because it becomes a commodity. I >>Mean, right. I mean, I mean and even ibm, right? >>Tom or I want to ask you about the storage thing. We were talking before camera, the, the hor and worst announcement storage you, what's your take on that? >>Which one? The Gluster, the one with Red Hats? Yes. Yes. So Red Hats and yeah, there has been recent news about Red Hat with, with Hor Works having a version of the Haddo platform that uses map use for the computation but uses Red Hat for the storage, right? So Red Hat has a new storage offering that was built based off of a company they acquired was called Guster. And that, that news was very, very surprising to me. And it, the reason why it was surprising, it's correlated also with a shift in messaging from, from Horton works. If you look at Horton Works last year at had Summit last year, one of the key messages that they deliver to us is that within the next five years or by 2015, the tagline back then by 2015, and you're doing research right now to see if I'm saying the right thing. By 2015, half the world data data will be on, will be stored in had would be stored in had. Yes. If you look today at the slides, it >>Doesn't say that it says within five years, >>Right? No, no, no. It says, well >>That was the second iteration was within five years. And now they say something >>Different. Now say they say within 2015 by, sorry, by 2015, half the world's data will be processed by Hado and instead of stored by Hado. And that's a very, very fundamental So >>It's a nuance. >>It's a, it's a very important >>Nuance. Well it's a big deal because yes, when I first saw that I said, Hmm, what does this all mean? And then it sounds 2015 sounds a little early. Yes. And now you're saying processed by, Okay that's different. >>Yes, exactly. And and the reason why now is we believe s GFS is very, very core to the had platform. S GFS is very core to had platform, the storage system of had we want. It's really the layer that Mid had with is more than anything else is how scalable, how reliable and how economical the sdfs storage layer is. So we, we really, I mean ask qu works and ask all the companies working in the, in the had community not to fragment at the storage layer. We need the storage for had to stay inside of had and not to fragment that out. That's very, very critical. >>Okay. So but so >>You're saying that they're in indicating through the gesture that, that they're not come out saying we're going to fragment Hgfs, but the way that this is position might signal >>No, no, no. The announcement, the announcement with Red Hat is >>That is the direct signal. It's >>Literally, we, you'll be able to run map produce directly on top of Red Hat storage instead of sdfs. >>Okay. So >>I >>Interpreted it, I interpret it as they were just hortonwork was hedging on its prediction, which I said Okay, I'll give 'em a break on that. You're saying it's something different, >>It's a shift in strategy potentially. Yeah. Which can be dangerous. It's shift in strategy. >>Is that a compliance issue? Cuz you know, the, the Dishon Hads poss Yeah. Red Hat does have a lot of enterprise customers. Yeah. So is that just maybe if >>Then invest in making had poss compliance, which actually by the way, we are as a community investing in that. Yeah. Yes. You must have. Yeah. So we are investing in adding compulsive poss compliance to had, we're investing in adding snapshots into had, which will be coming very, very soon overnight. >>Well, do you think that that pick a year, I don't care if it's 2015 2000, 22,000 whenever that the majority of the world's data will be running into do >>The majority of worse data that has to do with analytics. Yes. Okay. So so there is, >>So that is that >>Is it's very important, the caveat. Yes, exactly. Because there is lots of types of data that are not very suitable for, had at all. For example, that data storage for Oracle systems, for Oracle database systems. No, you wanna store that in an NetApp emc you don't wanna store that in Hao the, the, the, the, the data storage for streaming video files, right? For just streaming lots and lots of video files. No, you don't wanna store that indu. It's >>A huge >>Proportion of the data. Yeah. Which is a huge, huge >>Proportion of data files, in fact that could overwhelm the data. >>Yeah. So the new nuance, like I would say like I agree that the half thing but the half thing within the world of data for the purpose of analysis. >>Yeah. Okay. So that's, that's >>Narrow down the >>Yeah, okay. But it's a more reasonable, But I've, I >>Never, It's still a huge market by the way. It is. Yeah, >>It is. Yes. Okay. So, so what's next for you? A are you, you, you've gone on this, this journey, you start this company. You've, you've been traveling around like crazy working with customers. What's the next phase of aara do's, you know, career? >>What >>Do you want to have happen next? I mean, what, what do you, what excites you? What do you, what are you working on? >>Yeah, it's just to continue to grow cloud there to be the biggest company it can be. I mean, we want to be literally, we want be one of the very few companies that we're able to take an open source model and turn that into a large publicly traded corporation. >>So you've talked about that you guys brought a new CEO on Right. Look at the background of the ceo and it's, you know, clearly it's got some IPO chops. Yes. So that's, that's an aspiration that you guys have put forth. Okay. >>And you're outward facing now. So you're doing a lot of travel. Yes. So what, what, where have, what have your travels taken now? You've been in China, you obviously you've got a European office Yeah. Open. So what's going on internationally? Give us some sound bites of, of what's happening in the field. Yeah, >>So in, in internationally, I mean, Europe definitely is our next big focus right now. And we now have a big operation in Europe and we have an office presence in, in Europe and a big team down there. And it's growing very quickly. I would say Europe is about two years behind the US kind of like that's how the, how the growth usually matters. What's happening here. And yeah, so we, our, our next big market is Europe. We are looking at China. We don't have a big process in China right now. Japan, we have a big presence in Japan. Japan is growing very quickly. So yeah, I mean we're obviously Canada with the US growing very quickly as well. >>Great to have you on the cube again, for me personally and, and for, for Dave. And I wanna say thanks to Cloudera for some great support over the years. You guys have been fantastic. You know, I say it's built a great company. It's so hard to build a company. You guys have done a great job. I gotta ask you the final question because you did bring that first sound bite, which was, I saw the future, this is back when you guys were just in your B round in, in Palo Alto office, just ramping up, just starting to ramp what's next? What do you see as around the corner? Obviously we're on a trajectory right now. A lot of things gonna get done. Positive compliance, a lot of stuff's gonna fill in. The platform's gonna get stronger. Yeah. We think that open source will win. Yeah. Through all the democratization of open source. What's next? What's the, what's around the corner that you're watching personally that you're, that's interesting to you? A or around where this will take us? >>Yeah. So what, what's next is having this, having this vision become true. Having this future vision that, that you refer to become true. Meaning having a single platform that can store all of your data and that can, regardless of the type of that data, and allow you to extract value for different types of workloads, whether that be batch, interactive machine learning or search or more, right? There will be more things that will come to the platform, but how to bring your applications, all of your data applications, how to bring them to your data and all of your data as opposed to have the data go to them. >>And what are the landmines out there that you need to avoid Yes. In the industry and community needs to avoid to make that a reality. >>The, the key landmine, it's, it's a bit technical. The landmine is a bit technical, which is making sure that they, they are vision continues to evolve and that we have the capability to properly have a multi workload resource management system that allows me to run all of these type of workloads without having them step on each other's steps. That's the key key step going forward. And >>Of course, playing well together in the sandbox. And as always, competitive competition is good. And again, Hadup is doing great. Amma Aala, co-founder of Cloudera inside the Cube. This is Silicon Angle and Wiki Bond's exclusive coverage of ADU Summit here in Silicon Valley. Right back with our next guest after the short break.

Published Date : Jun 27 2013

SUMMARY :

We owe a great deal of gratitude to you and, and congratulations to you Michael Olson, It was great to be here. So what do you think, what's your take on the current Hadoop ecosystem right now? Should I look to you or look to the camera? The camera or both? there is a side question there, which is what do you think of all the competition coming into the space? what are you seeing right now as the white spaces for things to do in the So first I can't talk about future, future roadmap. you No, no, no, we're good. So you have multiple types of workloads that can handle different types of problems to, you know, do more with less in a lot of the things that you typically hear with the enter within the enterprise. You're gonna have the map produce workload, which is very batch So I want you share with the audience anything that you want say about the So I gave you what examples from Cloudera dot cutting. So the key point is, and and that's what I would like all of the vendors out there that We would like to see you both take and give at the same time. John and I have talked about it a lot that you guys help get it all started. And so you have choices to make Yes. So we strongly believe in that if you are, I thought it's like a card of paper with some writing. You, you have a fan fans out there. big corporations could be, we just can take everything you do, literally every single bit of source code you have So how do we manage the security of the system? So it sounds like the contributions you make to open source are cultural of, of, in nature, So these two things, Yes. You can go pure service, you can go, There is no, there is no pure service open source model company I mean, I mean and even ibm, right? Tom or I want to ask you about the storage thing. And it, the reason why it was surprising, it's correlated also with a shift in messaging No, no, no. It says, well And now they say something half the world's data will be processed by Hado and instead of stored And now you're saying processed And and the reason why now is we believe s GFS is very, That is the direct signal. Interpreted it, I interpret it as they were just hortonwork was hedging on its prediction, which I said Okay, It's a shift in strategy potentially. So is that just maybe if So we are investing in adding compulsive poss compliance to had, we're investing in adding snapshots So so there is, No, you wanna store that in an NetApp emc you don't wanna store that in Hao Proportion of the data. for the purpose of analysis. But it's a more reasonable, But I've, I Never, It's still a huge market by the way. What's the next phase of aara do's, you know, of the very few companies that we're able to take an open source model and turn that into So that's, that's an aspiration that you guys have You've been in China, you obviously you've got a European how the growth usually matters. that first sound bite, which was, I saw the future, this is back when you guys were just in your B round in, and allow you to extract value for different types of workloads, whether that be batch, interactive And what are the landmines out there that you need to avoid Yes. That's the key key step going forward. Amma Aala, co-founder of Cloudera inside the Cube.

ENTITIES

Entity	Category	Confidence
Michael Olson	PERSON	0.99+
John	PERSON	0.99+
Europe	LOCATION	0.99+
Mike Olson	PERSON	0.99+
six	QUANTITY	0.99+
John Fur	PERSON	0.99+
China	LOCATION	0.99+
Dave	PERSON	0.99+
Amma Aala	PERSON	0.99+
Cloudera	ORGANIZATION	0.99+
Silicon Valley	LOCATION	0.99+
Horton Works	ORGANIZATION	0.99+
Japan	LOCATION	0.99+
2015	DATE	0.99+
25	QUANTITY	0.99+
last year	DATE	0.99+
seven	QUANTITY	0.99+
Oracle	ORGANIZATION	0.99+
Palo Alto	LOCATION	0.99+
25 competitors	QUANTITY	0.99+
Dave Ante	PERSON	0.99+
Ama Aala	PERSON	0.99+
two	QUANTITY	0.99+
two problems	QUANTITY	0.99+
Red Hat	ORGANIZATION	0.99+
30 competitors	QUANTITY	0.99+
Calera	ORGANIZATION	0.99+
today	DATE	0.99+
First	QUANTITY	0.99+
both	QUANTITY	0.99+
ADU Summit	EVENT	0.99+
Hortonworks	ORGANIZATION	0.99+
five years ago	DATE	0.99+
second iteration	QUANTITY	0.99+
one	QUANTITY	0.98+
22,000	QUANTITY	0.98+
Horton	ORGANIZATION	0.98+
first vendor	QUANTITY	0.98+
five years	QUANTITY	0.98+
hundred percent	QUANTITY	0.98+
Red Hat	TITLE	0.98+
Canada	LOCATION	0.98+
Tia	ORGANIZATION	0.98+
Tom	PERSON	0.98+
Hor Works	ORGANIZATION	0.97+
first	QUANTITY	0.97+
Horton	PERSON	0.97+
two things	QUANTITY	0.97+
first interview	QUANTITY	0.97+
Stealth Mo	LOCATION	0.97+
half	QUANTITY	0.96+
Haak	PERSON	0.96+
one example	QUANTITY	0.96+
Hadoop Summit 2013	EVENT	0.95+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Impala: