Data Power Panel V3

(upbeat music) >> The stampede to cloud and massive VC investments has led to the emergence of a new generation of object store based data lakes. And with them two important trends, actually three important trends. First, a new category that combines data lakes and data warehouses aka the lakehouse is emerged as a leading contender to be the data platform of the future. And this novelty touts the ability to address data engineering, data science, and data warehouse workloads on a single shared data platform. The other major trend we've seen is query engines and broader data fabric virtualization platforms have embraced NextGen data lakes as platforms for SQL centric business intelligence workloads, reducing, or somebody even claim eliminating the need for separate data warehouses. Pretty bold. However, cloud data warehouses have added complimentary technologies to bridge the gaps with lakehouses. And the third is many, if not most customers that are embracing the so-called data fabric or data mesh architectures. They're looking at data lakes as a fundamental component of their strategies, and they're trying to evolve them to be more capable, hence the interest in lakehouse, but at the same time, they don't want to, or can't abandon their data warehouse estate. As such we see a battle royale is brewing between cloud data warehouses and cloud lakehouses. Is it possible to do it all with one cloud center analytical data platform? Well, we're going to find out. My name is Dave Vellante and welcome to the data platform's power panel on theCUBE. Our next episode in a series where we gather some of the industry's top analysts to talk about one of our favorite topics, data. In today's session, we'll discuss trends, emerging options, and the trade offs of various approaches and we'll name names. Joining us today are Sanjeev Mohan, who's the principal at SanjMo, Tony Baers, principal at dbInsight. And Doug Henschen is the vice president and principal analyst at Constellation Research. Guys, welcome back to theCUBE. Great to see you again. >> Thank guys. Thank you. >> Thank you. >> So it's early June and we're gearing up with two major conferences, there's several database conferences, but two in particular that were very interested in, Snowflake Summit and Databricks Data and AI Summit. Doug let's start off with you and then Tony and Sanjeev, if you could kindly weigh in. Where did this all start, Doug? The notion of lakehouse. And let's talk about what exactly we mean by lakehouse. Go ahead. >> Yeah, well you nailed it in your intro. One platform to address BI data science, data engineering, fewer platforms, less cost, less complexity, very compelling. You can credit Databricks for coining the term lakehouse back in 2020, but it's really a much older idea. You can go back to Cloudera introducing their Impala database in 2012. That was a database on top of Hadoop. And indeed in that last decade, by the middle of that last decade, there were several SQL on Hadoop products, open standards like Apache Drill. And at the same time, the database vendors were trying to respond to this interest in machine learning and the data science. So they were adding SQL extensions, the likes Hudi and Vertical we're adding SQL extensions to support the data science. But then later in that decade with the shift to cloud and object storage, you saw the vendor shift to this whole cloud, and object storage idea. So you have in the database camp Snowflake introduce Snowpark to try to address the data science needs. They introduced that in 2020 and last year they announced support for Python. You also had Oracle, SAP jumped on this lakehouse idea last year, supporting both the lake and warehouse single vendor, not necessarily quite single platform. Google very recently also jumped on the bandwagon. And then you also mentioned, the SQL engine camp, the Dremios, the Ahanas, the Starbursts, really doing two things, a fabric for distributed access to many data sources, but also very firmly planning that idea that you can just have the lake and we'll help you do the BI workloads on that. And then of course, the data lake camp with the Databricks and Clouderas providing a warehouse style deployments on top of their lake platforms. >> Okay, thanks, Doug. I'd be remiss those of you who me know that I typically write my own intros. This time my colleagues fed me a lot of that material. So thank you. You guys make it easy. But Tony, give us your thoughts on this intro. >> Right. Well, I very much agree with both of you, which may not make for the most exciting television in terms of that it has been an evolution just like Doug said. I mean, for instance, just to give an example when Teradata bought AfterData was initially seen as a hardware platform play. In the end, it was basically, it was all those after functions that made a lot of sort of big data analytics accessible to SQL. (clears throat) And so what I really see just in a more simpler definition or functional definition, the data lakehouse is really an attempt by the data lake folks to make the data lake friendlier territory to the SQL folks, and also to get into friendly territory, to all the data stewards, who are basically concerned about the sprawl and the lack of control in governance in the data lake. So it's really kind of a continuing of an ongoing trend that being said, there's no action without counter action. And of course, at the other end of the spectrum, we also see a lot of the data warehouses starting to edit things like in database machine learning. So they're certainly not surrendering without a fight. Again, as Doug was mentioning, this has been part of a continual blending of platforms that we've seen over the years that we first saw in the Hadoop years with SQL on Hadoop and data warehouses starting to reach out to cloud storage or should say the HDFS and then with the cloud then going cloud native and therefore trying to break the silos down even further. >> Now, thank you. And Sanjeev, data lakes, when we first heard about them, there were such a compelling name, and then we realized all the problems associated with them. So pick it up from there. What would you add to Doug and Tony? >> I would say, these are excellent points that Doug and Tony have brought to light. The concept of lakehouse was going on to your point, Dave, a long time ago, long before the tone was invented. For example, in Uber, Uber was trying to do a mix of Hadoop and Vertical because what they really needed were transactional capabilities that Hadoop did not have. So they weren't calling it the lakehouse, they were using multiple technologies, but now they're able to collapse it into a single data store that we call lakehouse. Data lakes, excellent at batch processing large volumes of data, but they don't have the real time capabilities such as change data capture, doing inserts and updates. So this is why lakehouse has become so important because they give us these transactional capabilities. >> Great. So I'm interested, the name is great, lakehouse. The concept is powerful, but I get concerned that it's a lot of marketing hype behind it. So I want to examine that a bit deeper. How mature is the concept of lakehouse? Are there practical examples that really exist in the real world that are driving business results for practitioners? Tony, maybe you could kick that off. >> Well, put it this way. I think what's interesting is that both data lakes and data warehouse that each had to extend themselves. To believe the Databricks hype it's that this was just a natural extension of the data lake. In point of fact, Databricks had to go outside its core technology of Spark to make the lakehouse possible. And it's a very similar type of thing on the part with data warehouse folks, in terms of that they've had to go beyond SQL, In the case of Databricks. There have been a number of incremental improvements to Delta lake, to basically make the table format more performative, for instance. But the other thing, I think the most dramatic change in all that is in their SQL engine and they had to essentially pretty much abandon Spark SQL because it really, in off itself Spark SQL is essentially stop gap solution. And if they wanted to really address that crowd, they had to totally reinvent SQL or at least their SQL engine. And so Databricks SQL is not Spark SQL, it is not Spark, it's basically SQL that it's adapted to run in a Spark environment, but the underlying engine is C++, it's not scale or anything like that. So Databricks had to take a major detour outside of its core platform to do this. So to answer your question, this is not mature because these are all basically kind of, even though the idea of blending platforms has been going on for well over a decade, I would say that the current iteration is still fairly immature. And in the cloud, I could see a further evolution of this because if you think through cloud native architecture where you're essentially abstracting compute from data, there is no reason why, if let's say you are dealing with say, the same basically data targets say cloud storage, cloud object storage that you might not apportion the task to different compute engines. And so therefore you could have, for instance, let's say you're Google, you could have BigQuery, perform basically the types of the analytics, the SQL analytics that would be associated with the data warehouse and you could have BigQuery ML that does some in database machine learning, but at the same time for another part of the query, which might involve, let's say some deep learning, just for example, you might go out to let's say the serverless spark service or the data proc. And there's no reason why Google could not blend all those into a coherent offering that's basically all triggered through microservices. And I just gave Google as an example, if you could generalize that with all the other cloud or all the other third party vendors. So I think we're still very early in the game in terms of maturity of data lakehouses. >> Thanks, Tony. So Sanjeev, is this all hype? What are your thoughts? >> It's not hype, but completely agree. It's not mature yet. Lakehouses have still a lot of work to do, so what I'm now starting to see is that the world is dividing into two camps. On one hand, there are people who don't want to deal with the operational aspects of vast amounts of data. They are the ones who are going for BigQuery, Redshift, Snowflake, Synapse, and so on because they want the platform to handle all the data modeling, access control, performance enhancements, but these are trade off. If you go with these platforms, then you are giving up on vendor neutrality. On the other side are those who have engineering skills. They want the independence. In other words, they don't want vendor lock in. They want to transform their data into any number of use cases, especially data science, machine learning use case. What they want is agility via open file formats using any compute engine. So why do I say lakehouses are not mature? Well, cloud data warehouses they provide you an excellent user experience. That is the main reason why Snowflake took off. If you have thousands of cables, it takes minutes to get them started, uploaded into your warehouse and start experimentation. Table formats are far more resonating with the community than file formats. But once the cost goes up of cloud data warehouse, then the organization start exploring lakehouses. But the problem is lakehouses still need to do a lot of work on metadata. Apache Hive was a fantastic first attempt at it. Even today Apache Hive is still very strong, but it's all technical metadata and it has so many different restrictions. That's why we see Databricks is investing into something called Unity Catalog. Hopefully we'll hear more about Unity Catalog at the end of the month. But there's a second problem. I just want to mention, and that is lack of standards. All these open source vendors, they're running, what I call ego projects. You see on LinkedIn, they're constantly battling with each other, but end user doesn't care. End user wants a problem to be solved. They want to use Trino, Dremio, Spark from EMR, Databricks, Ahana, DaaS, Frink, Athena. But the problem is that we don't have common standards. >> Right. Thanks. So Doug, I worry sometimes. I mean, I look at the space, we've debated for years, best of breed versus the full suite. You see AWS with whatever, 12 different plus data stores and different APIs and primitives. You got Oracle putting everything into its database. It's actually done some interesting things with MySQL HeatWave, so maybe there's proof points there, but Snowflake really good at data warehouse, simplifying data warehouse. Databricks, really good at making lakehouses actually more functional. Can one platform do it all? >> Well in a word, I can't be best at breed at all things. I think the upshot of and cogen analysis from Sanjeev there, the database, the vendors coming out of the database tradition, they excel at the SQL. They're extending it into data science, but when it comes to unstructured data, data science, ML AI often a compromise, the data lake crowd, the Databricks and such. They've struggled to completely displace the data warehouse when it really gets to the tough SLAs, they acknowledge that there's still a role for the warehouse. Maybe you can size down the warehouse and offload some of the BI workloads and maybe and some of these SQL engines, good for ad hoc, minimize data movement. But really when you get to the deep service level, a requirement, the high concurrency, the high query workloads, you end up creating something that's warehouse like. >> Where do you guys think this market is headed? What's going to take hold? Which projects are going to fade away? You got some things in Apache projects like Hudi and Iceberg, where do they fit Sanjeev? Do you have any thoughts on that? >> So thank you, Dave. So I feel that table formats are starting to mature. There is a lot of work that's being done. We will not have a single product or single platform. We'll have a mixture. So I see a lot of Apache Iceberg in the news. Apache Iceberg is really innovating. Their focus is on a table format, but then Delta and Apache Hudi are doing a lot of deep engineering work. For example, how do you handle high concurrency when there are multiple rights going on? Do you version your Parquet files or how do you do your upcerts basically? So different focus, at the end of the day, the end user will decide what is the right platform, but we are going to have multiple formats living with us for a long time. >> Doug is Iceberg in your view, something that's going to address some of those gaps in standards that Sanjeev was talking about earlier? >> Yeah, Delta lake, Hudi, Iceberg, they all address this need for consistency and scalability, Delta lake open technically, but open for access. I don't hear about Delta lakes in any worlds, but Databricks, hearing a lot of buzz about Apache Iceberg. End users want an open performance standard. And most recently Google embraced Iceberg for its recent a big lake, their stab at having supporting both lakes and warehouses on one conjoined platform. >> And Tony, of course, you remember the early days of the sort of big data movement you had MapR was the most closed. You had Horton works the most open. You had Cloudera in between. There was always this kind of contest as to who's the most open. Does that matter? Are we going to see a repeat of that here? >> I think it's spheres of influence, I think, and Doug very much was kind of referring to this. I would call it kind of like the MongoDB syndrome, which is that you have... and I'm talking about MongoDB before they changed their license, open source project, but very much associated with MongoDB, which basically, pretty much controlled most of the contributions made decisions. And I think Databricks has the same iron cloud hold on Delta lake, but still the market is pretty much associated Delta lake as the Databricks, open source project. I mean, Iceberg is probably further advanced than Hudi in terms of mind share. And so what I see that's breaking down to is essentially, basically the Databricks open source versus the everything else open source, the community open source. So I see it's a very similar type of breakdown that I see repeating itself here. >> So by the way, Mongo has a conference next week, another data platform is kind of not really relevant to this discussion totally. But in the sense it is because there's a lot of discussion on earnings calls these last couple of weeks about consumption and who's exposed, obviously people are concerned about Snowflake's consumption model. Mongo is maybe less exposed because Atlas is prominent in the portfolio, blah, blah, blah. But I wanted to bring up the little bit of controversy that we saw come out of the Snowflake earnings call, where the ever core analyst asked Frank Klutman about discretionary spend. And Frank basically said, look, we're not discretionary. We are deeply operationalized. Whereas he kind of poo-pooed the lakehouse or the data lake, et cetera, saying, oh yeah, data scientists will pull files out and play with them. That's really not our business. Do any of you have comments on that? Help us swing through that controversy. Who wants to take that one? >> Let's put it this way. The SQL folks are from Venus and the data scientists are from Mars. So it means it really comes down to it, sort that type of perception. The fact is, is that, traditionally with analytics, it was very SQL oriented and that basically the quants were kind of off in their corner, where they're using SaaS or where they're using Teradata. It's really a great leveler today, which is that, I mean basic Python it's become arguably one of the most popular programming languages, depending on what month you're looking at, at the title index. And of course, obviously SQL is, as I tell the MongoDB folks, SQL is not going away. You have a large skills base out there. And so basically I see this breaking down to essentially, you're going to have each group that's going to have its own natural preferences for its home turf. And the fact that basically, let's say the Python and scale of folks are using Databricks does not make them any less operational or machine critical than the SQL folks. >> Anybody else want to chime in on that one? >> Yeah, I totally agree with that. Python support in Snowflake is very nascent with all of Snowpark, all of the things outside of SQL, they're very much relying on partners too and make things possible and make data science possible. And it's very early days. I think the bottom line, what we're going to see is each of these camps is going to keep working on doing better at the thing that they don't do today, or they're new to, but they're not going to nail it. They're not going to be best of breed on both sides. So the SQL centric companies and shops are going to do more data science on their database centric platform. That data science driven companies might be doing more BI on their leagues with those vendors and the companies that have highly distributed data, they're going to add fabrics, and maybe offload more of their BI onto those engines, like Dremio and Starburst. >> So I've asked you this before, but I'll ask you Sanjeev. 'Cause Snowflake and Databricks are such great examples 'cause you have the data engineering crowd trying to go into data warehousing and you have the data warehousing guys trying to go into the lake territory. Snowflake has $5 billion in the balance sheet and I've asked you before, I ask you again, doesn't there has to be a semantic layer between these two worlds? Does Snowflake go out and do M&A and maybe buy ad scale or a data mirror? Or is that just sort of a bandaid? What are your thoughts on that Sanjeev? >> I think semantic layer is the metadata. The business metadata is extremely important. At the end of the day, the business folks, they'd rather go to the business metadata than have to figure out, for example, like let's say, I want to update somebody's email address and we have a lot of overhead with data residency laws and all that. I want my platform to give me the business metadata so I can write my business logic without having to worry about which database, which location. So having that semantic layer is extremely important. In fact, now we are taking it to the next level. Now we are saying that it's not just a semantic layer, it's all my KPIs, all my calculations. So how can I make those calculations independent of the compute engine, independent of the BI tool and make them fungible. So more disaggregation of the stack, but it gives us more best of breed products that the customers have to worry about. >> So I want to ask you about the stack, the modern data stack, if you will. And we always talk about injecting machine intelligence, AI into applications, making them more data driven. But when you look at the application development stack, it's separate, the database is tends to be separate from the data and analytics stack. Do those two worlds have to come together in the modern data world? And what does that look like organizationally? >> So organizationally even technically I think it is starting to happen. Microservices architecture was a first attempt to bring the application and the data world together, but they are fundamentally different things. For example, if an application crashes, that's horrible, but Kubernetes will self heal and it'll bring the application back up. But if a database crashes and corrupts your data, we have a huge problem. So that's why they have traditionally been two different stacks. They are starting to come together, especially with data ops, for instance, versioning of the way we write business logic. It used to be, a business logic was highly embedded into our database of choice, but now we are disaggregating that using GitHub, CICD the whole DevOps tool chain. So data is catching up to the way applications are. >> We also have databases, that trans analytical databases that's a little bit of what the story is with MongoDB next week with adding more analytical capabilities. But I think companies that talk about that are always careful to couch it as operational analytics, not the warehouse level workloads. So we're making progress, but I think there's always going to be, or there will long be a separate analytical data platform. >> Until data mesh takes over. (all laughing) Not opening a can of worms. >> Well, but wait, I know it's out of scope here, but wouldn't data mesh say, hey, do take your best of breed to Doug's earlier point. You can't be best of breed at everything, wouldn't data mesh advocate, data lakes do your data lake thing, data warehouse, do your data lake, then you're just a node on the mesh. (Tony laughs) Now you need separate data stores and you need separate teams. >> To my point. >> I think, I mean, put it this way. (laughs) Data mesh itself is a logical view of the world. The data mesh is not necessarily on the lake or on the warehouse. I think for me, the fear there is more in terms of, the silos of governance that could happen and the silo views of the world, how we redefine. And that's why and I want to go back to something what Sanjeev said, which is that it's going to be raising the importance of the semantic layer. Now does Snowflake that opens a couple of Pandora's boxes here, which is one, does Snowflake dare go into that space or do they risk basically alienating basically their partner ecosystem, which is a key part of their whole appeal, which is best of breed. They're kind of the same situation that Informatica was where in the early 2000s, when Informatica briefly flirted with analytic applications and realized that was not a good idea, need to redouble down on their core, which was data integration. The other thing though, that raises the importance of and this is where the best of breed comes in, is the data fabric. My contention is that and whether you use employee data mesh practice or not, if you do employee data mesh, you need data fabric. If you deploy data fabric, you don't necessarily need to practice data mesh. But data fabric at its core and admittedly it's a category that's still very poorly defined and evolving, but at its core, we're talking about a common meta data back plane, something that we used to talk about with master data management, this would be something that would be more what I would say basically, mutable, that would be more evolving, basically using, let's say, machine learning to kind of, so that we don't have to predefine rules or predefine what the world looks like. But so I think in the long run, what this really means is that whichever way we implement on whichever physical platform we implement, we need to all be speaking the same metadata language. And I think at the end of the day, regardless of whether it's a lake, warehouse or a lakehouse, we need common metadata. >> Doug, can I come back to something you pointed out? That those talking about bringing analytic and transaction databases together, you had talked about operationalizing those and the caution there. Educate me on MySQL HeatWave. I was surprised when Oracle put so much effort in that, and you may or may not be familiar with it, but a lot of folks have talked about that. Now it's got nowhere in the market, that no market share, but a lot of we've seen these benchmarks from Oracle. How real is that bringing together those two worlds and eliminating ETL? >> Yeah, I have to defer on that one. That's my colleague, Holger Mueller. He wrote the report on that. He's way deep on it and I'm not going to mock him. >> I wonder if that is something, how real that is or if it's just Oracle marketing, anybody have any thoughts on that? >> I'm pretty familiar with HeatWave. It's essentially Oracle doing what, I mean, there's kind of a parallel with what Google's doing with AlloyDB. It's an operational database that will have some embedded analytics. And it's also something which I expect to start seeing with MongoDB. And I think basically, Doug and Sanjeev were kind of referring to this before about basically kind of like the operational analytics, that are basically embedded within an operational database. The idea here is that the last thing you want to do with an operational database is slow it down. So you're not going to be doing very complex deep learning or anything like that, but you might be doing things like classification, you might be doing some predictives. In other words, we've just concluded a transaction with this customer, but was it less than what we were expecting? What does that mean in terms of, is this customer likely to turn? I think we're going to be seeing a lot of that. And I think that's what a lot of what MySQL HeatWave is all about. Whether Oracle has any presence in the market now it's still a pretty new announcement, but the other thing that kind of goes against Oracle, (laughs) that they had to battle against is that even though they own MySQL and run the open source project, everybody else, in terms of the actual commercial implementation it's associated with everybody else. And the popular perception has been that MySQL has been basically kind of like a sidelight for Oracle. And so it's on Oracles shoulders to prove that they're damn serious about it. >> There's no coincidence that MariaDB was launched the day that Oracle acquired Sun. Sanjeev, I wonder if we could come back to a topic that we discussed earlier, which is this notion of consumption, obviously Wall Street's very concerned about it. Snowflake dropped prices last week. I've always felt like, hey, the consumption model is the right model. I can dial it down in when I need to, of course, the street freaks out. What are your thoughts on just pricing, the consumption model? What's the right model for companies, for customers? >> Consumption model is here to stay. What I would like to see, and I think is an ideal situation and actually plays into the lakehouse concept is that, I have my data in some open format, maybe it's Parquet or CSV or JSON, Avro, and I can bring whatever engine is the best engine for my workloads, bring it on, pay for consumption, and then shut it down. And by the way, that could be Cloudera. We don't talk about Cloudera very much, but it could be one business unit wants to use Athena. Another business unit wants to use some other Trino let's say or Dremio. So every business unit is working on the same data set, see that's critical, but that data set is maybe in their VPC and they bring any compute engine, you pay for the use, shut it down. That then you're getting value and you're only paying for consumption. It's not like, I left a cluster running by mistake, so there have to be guardrails. The reason FinOps is so big is because it's very easy for me to run a Cartesian joint in the cloud and get a $10,000 bill. >> This looks like it's been a sort of a victim of its own success in some ways, they made it so easy to spin up single note instances, multi note instances. And back in the day when compute was scarce and costly, those database engines optimized every last bit so they could get as much workload as possible out of every instance. Today, it's really easy to spin up a new node, a new multi node cluster. So that freedom has meant many more nodes that aren't necessarily getting that utilization. So Snowflake has been doing a lot to add reporting, monitoring, dashboards around the utilization of all the nodes and multi node instances that have spun up. And meanwhile, we're seeing some of the traditional on-prem databases that are moving into the cloud, trying to offer that freedom. And I think they're going to have that same discovery that the cost surprises are going to follow as they make it easy to spin up new instances. >> Yeah, a lot of money went into this market over the last decade, separating compute from storage, moving to the cloud. I'm glad you mentioned Cloudera Sanjeev, 'cause they got it all started, the kind of big data movement. We don't talk about them that much. Sometimes I wonder if it's because when they merged Hortonworks and Cloudera, they dead ended both platforms, but then they did invest in a more modern platform. But what's the future of Cloudera? What are you seeing out there? >> Cloudera has a good product. I have to say the problem in our space is that there're way too many companies, there's way too much noise. We are expecting the end users to parse it out or we expecting analyst firms to boil it down. So I think marketing becomes a big problem. As far as technology is concerned, I think Cloudera did turn their selves around and Tony, I know you, you talked to them quite frequently. I think they have quite a comprehensive offering for a long time actually. They've created Kudu, so they got operational, they have Hadoop, they have an operational data warehouse, they're migrated to the cloud. They are in hybrid multi-cloud environment. Lot of cloud data warehouses are not hybrid. They're only in the cloud. >> Right. I think what Cloudera has done the most successful has been in the transition to the cloud and the fact that they're giving their customers more OnRamps to it, more hybrid OnRamps. So I give them a lot of credit there. They're also have been trying to position themselves as being the most price friendly in terms of that we will put more guardrails and governors on it. I mean, part of that could be spin. But on the other hand, they don't have the same vested interest in compute cycles as say, AWS would have with EMR. That being said, yes, Cloudera does it, I think its most powerful appeal so of that, it almost sounds in a way, I don't want to cast them as a legacy system. But the fact is they do have a huge landed legacy on-prem and still significant potential to land and expand that to the cloud. That being said, even though Cloudera is multifunction, I think it certainly has its strengths and weaknesses. And the fact this is that yes, Cloudera has an operational database or an operational data store with a kind of like the outgrowth of age base, but Cloudera is still based, primarily known for the deep analytics, the operational database nobody's going to buy Cloudera or Cloudera data platform strictly for the operational database. They may use it as an add-on, just in the same way that a lot of customers have used let's say Teradata basically to do some machine learning or let's say, Snowflake to parse through JSON. Again, it's not an indictment or anything like that, but the fact is obviously they do have their strengths and their weaknesses. I think their greatest opportunity is with their existing base because that base has a lot invested and vested. And the fact is they do have a hybrid path that a lot of the others lack. >> And of course being on the quarterly shock clock was not a good place to be under the microscope for Cloudera and now they at least can refactor the business accordingly. I'm glad you mentioned hybrid too. We saw Snowflake last month, did a deal with Dell whereby non-native Snowflake data could access on-prem object store from Dell. They announced a similar thing with pure storage. What do you guys make of that? Is that just... How significant will that be? Will customers actually do that? I think they're using either materialized views or extended tables. >> There are data rated and residency requirements. There are desires to have these platforms in your own data center. And finally they capitulated, I mean, Frank Klutman is famous for saying to be very focused and earlier, not many months ago, they called the going on-prem as a distraction, but clearly there's enough demand and certainly government contracts any company that has data residency requirements, it's a real need. So they finally addressed it. >> Yeah, I'll bet dollars to donuts, there was an EBC session and some big customer said, if you don't do this, we ain't doing business with you. And that was like, okay, we'll do it. >> So Dave, I have to say, earlier on you had brought this point, how Frank Klutman was poo-pooing data science workloads. On your show, about a year or so ago, he said, we are never going to on-prem. He burnt that bridge. (Tony laughs) That was on your show. >> I remember exactly the statement because it was interesting. He said, we're never going to do the halfway house. And I think what he meant is we're not going to bring the Snowflake architecture to run on-prem because it defeats the elasticity of the cloud. So this was kind of a capitulation in a way. But I think it still preserves his original intent sort of, I don't know. >> The point here is that every vendor will poo-poo whatever they don't have until they do have it. >> Yes. >> And then it'd be like, oh, we are all in, we've always been doing this. We have always supported this and now we are doing it better than others. >> Look, it was the same type of shock wave that we felt basically when AWS at the last moment at one of their reinvents, oh, by the way, we're going to introduce outposts. And the analyst group is typically pre briefed about a week or two ahead under NDA and that was not part of it. And when they dropped, they just casually dropped that in the analyst session. It's like, you could have heard the sound of lots of analysts changing their diapers at that point. >> (laughs) I remember that. And a props to Andy Jassy who once, many times actually told us, never say never when it comes to AWS. So guys, I know we got to run. We got some hard stops. Maybe you could each give us your final thoughts, Doug start us off and then-- >> Sure. Well, we've got the Snowflake Summit coming up. I'll be looking for customers that are really doing data science, that are really employing Python through Snowflake, through Snowpark. And then a couple weeks later, we've got Databricks with their Data and AI Summit in San Francisco. I'll be looking for customers that are really doing considerable BI workloads. Last year I did a market overview of this analytical data platform space, 14 vendors, eight of them claim to support lakehouse, both sides of the camp, Databricks customer had 32, their top customer that they could site was unnamed. It had 32 concurrent users doing 15,000 queries per hour. That's good but it's not up to the most demanding BI SQL workloads. And they acknowledged that and said, they need to keep working that. Snowflake asked for their biggest data science customer, they cited Kabura, 400 terabytes, 8,500 users, 400,000 data engineering jobs per day. I took the data engineering job to be probably SQL centric, ETL style transformation work. So I want to see the real use of the Python, how much Snowpark has grown as a way to support data science. >> Great. Tony. >> Actually of all things. And certainly, I'll also be looking for similar things in what Doug is saying, but I think sort of like, kind of out of left field, I'm interested to see what MongoDB is going to start to say about operational analytics, 'cause I mean, they're into this conquer the world strategy. We can be all things to all people. Okay, if that's the case, what's going to be a case with basically, putting in some inline analytics, what are you going to be doing with your query engine? So that's actually kind of an interesting thing we're looking for next week. >> Great. Sanjeev. >> So I'll be at MongoDB world, Snowflake and Databricks and very interested in seeing, but since Tony brought up MongoDB, I see that even the databases are shifting tremendously. They are addressing both the hashtag use case online, transactional and analytical. I'm also seeing that these databases started in, let's say in case of MySQL HeatWave, as relational or in MongoDB as document, but now they've added graph, they've added time series, they've added geospatial and they just keep adding more and more data structures and really making these databases multifunctional. So very interesting. >> It gets back to our discussion of best of breed, versus all in one. And it's likely Mongo's path or part of their strategy of course, is through developers. They're very developer focused. So we'll be looking for that. And guys, I'll be there as well. I'm hoping that we maybe have some extra time on theCUBE, so please stop by and we can maybe chat a little bit. Guys as always, fantastic. Thank you so much, Doug, Tony, Sanjeev, and let's do this again. >> It's been a pleasure. >> All right and thank you for watching. This is Dave Vellante for theCUBE and the excellent analyst. We'll see you next time. (upbeat music)

Published Date : Jun 2 2022

SUMMARY :

And Doug Henschen is the vice president Thank you. Doug let's start off with you And at the same time, me a lot of that material. And of course, at the and then we realized all the and Tony have brought to light. So I'm interested, the And in the cloud, So Sanjeev, is this all hype? But the problem is that we I mean, I look at the space, and offload some of the So different focus, at the end of the day, and warehouses on one conjoined platform. of the sort of big data movement most of the contributions made decisions. Whereas he kind of poo-pooed the lakehouse and the data scientists are from Mars. and the companies that have in the balance sheet that the customers have to worry about. the modern data stack, if you will. and the data world together, the story is with MongoDB Until data mesh takes over. and you need separate teams. that raises the importance of and the caution there. Yeah, I have to defer on that one. The idea here is that the of course, the street freaks out. and actually plays into the And back in the day when the kind of big data movement. We are expecting the end And the fact is they do have a hybrid path refactor the business accordingly. saying to be very focused And that was like, okay, we'll do it. So Dave, I have to say, the Snowflake architecture to run on-prem The point here is that and now we are doing that in the analyst session. And a props to Andy Jassy and said, they need to keep working that. Great. Okay, if that's the case, Great. I see that even the databases I'm hoping that we maybe have and the excellent analyst.

ENTITIES

Entity	Category	Confidence
Doug	PERSON	0.99+
Dave Vellante	PERSON	0.99+
Dave	PERSON	0.99+
Tony	PERSON	0.99+
Uber	ORGANIZATION	0.99+
Frank	PERSON	0.99+
Frank Klutman	PERSON	0.99+
Tony Baers	PERSON	0.99+
Mars	LOCATION	0.99+
Doug Henschen	PERSON	0.99+
2020	DATE	0.99+
AWS	ORGANIZATION	0.99+
Venus	LOCATION	0.99+
Oracle	ORGANIZATION	0.99+
2012	DATE	0.99+
Databricks	ORGANIZATION	0.99+
Dell	ORGANIZATION	0.99+
Hortonworks	ORGANIZATION	0.99+
Holger Mueller	PERSON	0.99+
Andy Jassy	PERSON	0.99+
last year	DATE	0.99+
$5 billion	QUANTITY	0.99+
$10,000	QUANTITY	0.99+
14 vendors	QUANTITY	0.99+
Last year	DATE	0.99+
last week	DATE	0.99+
San Francisco	LOCATION	0.99+
SanjMo	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
8,500 users	QUANTITY	0.99+
Sanjeev	PERSON	0.99+
Informatica	ORGANIZATION	0.99+
32 concurrent users	QUANTITY	0.99+
two	QUANTITY	0.99+
Constellation Research	ORGANIZATION	0.99+
Mongo	ORGANIZATION	0.99+
Sanjeev Mohan	PERSON	0.99+
Ahana	ORGANIZATION	0.99+
DaaS	ORGANIZATION	0.99+
EMR	ORGANIZATION	0.99+
32	QUANTITY	0.99+
Atlas	ORGANIZATION	0.99+
Delta	ORGANIZATION	0.99+
Snowflake	ORGANIZATION	0.99+
Python	TITLE	0.99+
each	QUANTITY	0.99+
Athena	ORGANIZATION	0.99+
next week	DATE	0.99+

Steven Mih, Ahana & Girish Baliga, Uber | CUBE Conversation

(bright music) >> Hey everyone, welcome to this CUBE conversation featuring Ahana, I'm your host Lisa Martin. I've got two guests here with me today. Steven Mih joins us, the Presto Foundation governing board member, co-founder and CEO of Ahana, and Girish Baliga Presto Foundation governing board chair and senior engineering manager at Uber. Guys thanks for joining us. >> Thanks for having us. >> Thanks for having us. >> So Steven we're going to dig into and unpack Presto in the next few minutes or so, but Steven let's go ahead and start with you. Talk to us about some of the challenges with the open data lake house market. What are some of those key challenges that organizations are facing? >> Yeah, just pulling up the slide you know, what we see is that many organizations are dealing with a lot more data and very different data types and putting that all into, traditionally as the data warehouse, which has been the workhorse for BI and analytics traditionally, it becomes very, very expensive, and there's a lot of lock in associated with that. And so what's happening is that people are putting the data semistructured and unstructured data for example, in cloud data lakes or other data lakes, and they find that they can query directly with a SQL query engine like Presto. And that lets you have a much more approach to dealing with getting insights out of your data. And that's what this is all about, and that's why companies are moving to a modern architecture. Girish maybe you can share some of your thoughts on how Uber uses Presto for this. >> Yeah, at Uber we use Presto in our internal deployments. So at Uber we have our own data centers, we store data locally in our data centers, but we have made the conscious choice to go with an open data stack. Our entire data stack is built around open source technologies like Hadoop, Hive, Spark and Presto. And so Presto is an invaluable engine that is able to connect to all these different storage and data formats and allow us to have a single entry point for our users, to run their SQL engines and get insights rather quickly compared to some of the other engines that we have at Uber. >> So let's talk a little bit about Presto so that the audience gets a good overview of that. Steven starting with you, you talked about the challenges of the traditional data warehouse application. Talk to us about why Presto was founded the open, the project, give us that background information if you will. >> Absolutely, so Presto was originally developed out of the biggest hyperscaler out there which is Facebook now known as Meta. And they donated that project to the, and open sourced it and donated it to the Linux Foundation. And so Presto is a SQL query engine, it's a storage SQL query engine, that runs directly on open data lakes, so you can put your data into open formats like 4K or C, and get insights directly from that at a very good price performance ratio. The Presto Foundation of which Girish and I are part of, we're all working together as a consortium of companies that all want to see Presto continue to get bigger and bigger. Kind of like Kubernetes has a, has an organization called CNCF, Presto has Presto Foundation all under the umbrella of the Linux Foundation. And so there's a lot of exciting things that are coming on the roadmap that make Presto very unique. You know, RaptorX is a multilevel caching system that it's been fantastic, Aria optimizations are another area, we Ahana have developed some security features with donating the integrations with Apache Ranger and that's the type of things that we do to help the community. But maybe Girish can talk about some of the exciting items on the roadmap that you're looking forward to. >> Absolutely, I think from Uber's point of view just a sheer scale of data and our volume of query traffic. So we run about half a million Presto queries a day, right? And we have thousands of machines in our Presto deployments. So at that scale in addition to functionality you really want a system that can handle traffic reliably, that can scale, and that is backed by a strong community which guarantees that if you pull in the new version of Presto, you won't break anything, right? So all of those things are very important to us. So I think that's where we are relying on our partners particularly folks like Facebook and Twitter and Ahana to build and maintain this ecosystem that gives us those guarantees. So that is on the reliability front, but on the roadmap side we are also excited to see where Presto is extending. So in addition to the projects that Steven talked about, we are also looking at things like Presto and Spark, right? So take the Presto SQL and run it as a Spark job for instance, or running Presto on real-time analytics applications something that we built and contributed from Uber side. So we are all taking it in very different directions, we all have different use cases to support, and that's the exciting thing about the foundation. That it allows us all to work together to get Presto to a bigger and better and more flexible engine. >> You guys mentioned Facebook and I saw on the slide I think Twitter as well. Talk to me about some of the organizations that are leveraging the Presto engine and some of the business benefits. I think Steve you talked about insights, Steven obviously being able to get insights from data is critical for every business these days. >> Yeah, a major, major use case is finding the ad hoc and interactive queries, and being able to drive insights from doing so. And so, as I mentioned there's so much data that's being generated and stored, and to be able to query that data in place, at a, with very, very high performance, meaning that you can get answers back in seconds of time. That lets you have the interactive ability to drill into data and innovate your business. And so this is fantastic because it's been developed at hyperscalers like Uber that allow you to have open source technology, pick that up, and just download it right from prestodb.io, and then start to run with this and join the community. I think from an open source perspective this project under the governance of Linux Foundation gives you the confidence that it's fully transparent and you'll never see any licensing changes by the Linux Foundation charter. And therefore that means the technology remains free forever without later on limitations occurring, which then would perhaps favor commercialization of any one vendor. That's not the case. So maybe Girish your thoughts on how we've been able to attract industry giants to collaborate, to innovate further, and your thoughts on that. >> Yeah, so of the interesting I've seen in the space is that there is a bifurcation of companies in this ecosystem. So there are these large internet scale companies like Facebook, and Uber, and Twitter, which basically want to use something like Presto for their internal use cases. And then there is the second set of companies, enterprise companies like Ahana which basically wanted to take Presto and provide it as a service for other companies to use as an alternative to things like Snowflake and other systems right? So, and the foundation is a great place for both sets of companies to come together and work. The internet scale companies bring in the scale, the reliability, the different kind of ways in which you can challenge the system, optimize it, and so forth, and then companies like Ahana bring in the flexibility and the extensibility. So you can work with different clouds, different storage formats, different engines, and I think it's a great partnership that we can see happening primarily through the foundational spaces. Which you would be hard pressed to find in a single vendor or a, you know, a single-source system that is there on the market today. >> How long ago was the Presto Foundation initiated? >> It's been over three years now and it's been going strong, we're over a dozen members and it's open to everyone. And it's all governed like the Linux Foundation so we use best practices from that and you can just check it out at prestodb.io where you can get the software, or you can hear about how to join the foundation. So it includes members like Intel, and HPE as well, and we're really excited for new members to come, and contribute in and participate. >> Sounds like you've got good momentum there in the foundation. Steven talk a little bit about the last two years. Have you seen the acceleration in use cases in the number of users as we've been in such an interesting environment where the need for real-time insights is essential for every business initially a few couple of years ago to survive but now to be, to really thrive, is it, have you seen the acceleration in Presto in that timeframe? >> Absolutely, we see there's acceleration of being more data-driven and especially moving to cloud and having more data in the cloud, we think that innovation is happening, digital innovation is happening very fast and Presto is a major enabler of that, again, being able to get, drive insights from the data this is not just your typical business data, it's now getting into really clickstream data, knowing about how customers are operating today, Uber is a great example of all the different types of innovations they can drive, whether it be, you know, knowing in real time what's happening with rides, or offering you a subscription for special deals to use the service more. So, you know, Ahana we really love Presto, and we provide a SaaS manage service of the open source and provide free trials, and help people get up to speed that may not have the same type of skills as Uber or Facebook does. And we work with all companies in that way. >> Think about the consumers these days, we're very demanding, right? When I think one of the things that was in short supply during the last two years was patience. And if I think of Uber as a great example, I want to know if I'm asking for a ride I want to know exactly in real time what's coming for me? Where is it now? How many more minutes is it going to take? I mean, that need to fulfill real-time insights is critical across every industry but have you seen anything in the last couple years that's been more leading edge, like e-commerce or retail for example? I'm just curious. >> Girish you want to take that one or? >> Yeah, sure. So I can speak from the Uber point of view. So real-time insights has really exploded as an area, particularly as you mentioned with this just-in-time economy, right? Just to talk about it a little bit from Uber side, so some of the insights that you mentioned about when is your ride coming, and things of that nature, right? Look at it from the driver's point of view who are, now we have Uber Eats, so look at it from the restaurant manager's point of view, right? They also want to know how is their business coming? How many customer orders are coming for instance? what is the conversion rate? And so forth, right? And today these are all insights that are powered by a system which has a Presto as an front-end interface at Uber. And these queries run like, you have like tens of thousands of queries every single second, and the queries run in like a second and so forth. So you are really talking about production systems running on top of Presto, production serving systems. So coming to other use cases like eCommerce, we definitely have seen some of that uptake happen as well, so in the broader community for instance, we have companies like Stripe, and other folks who are also using this hashtag which is very similar to us based on another open source technology called Pino, using Presto as an interface. And so we are seeing this whole open data lakehouse more from just being, you know, about interactive analytics to driving all different kinds of analytics. Having anything to do with data and insights in this space. >> Yeah, sounds like the evolution has been kind of on a rocket ship the last couple years. Steven, one more time we're out of time, but can you mention that URL where folks can go to learn more? >> Yeah, prestodb.io and that's the Presto Foundation. And you know, just want to say that we'll be sharing the use case at the Startup Showcase coming up with theCUBE. We're excited about that and really welcome everyone to join the community, it's a real vibrant, expanding community and look forward to seeing you online. >> Sounds great guys. Thank you so much for sharing with us what Presto Foundation is doing, all of the things that it is catalyzing, great stuff, we look forward to hearing that customer use case, thanks for your time. >> Thank you. >> Thanks Lisa, thank you. >> Thanks everyone. >> For Steven and Girish, I'm Lisa Martin, you're watching theCUBE the leader in live tech coverage. (bright music)

Published Date : Mar 24 2022

SUMMARY :

and Girish Baliga Presto in the next few minutes or so, And that lets you have that is able to connect to so that the audience gets and that's the type of things that we do So that is on the reliability front, and some of the business benefits. and then start to run with So, and the foundation is a great place and it's open to everyone. in the number of users as we've been and having more data in the cloud, I mean, that need to fulfill so some of the insights that you mentioned Yeah, sounds like the evolution and look forward to seeing you online. all of the things that it For Steven and Girish, I'm Lisa Martin,

ENTITIES

Entity	Category	Confidence
Lisa Martin	PERSON	0.99+
Steven	PERSON	0.99+
Steve	PERSON	0.99+
Girish	PERSON	0.99+
Lisa	PERSON	0.99+
Uber	ORGANIZATION	0.99+
Steven Mih	PERSON	0.99+
Presto Foundation	ORGANIZATION	0.99+
Facebook	ORGANIZATION	0.99+
Ahana	ORGANIZATION	0.99+
Linux Foundation	ORGANIZATION	0.99+
CNCF	ORGANIZATION	0.99+
Twitter	ORGANIZATION	0.99+
Intel	ORGANIZATION	0.99+
two guests	QUANTITY	0.99+
HPE	ORGANIZATION	0.99+
Presto	ORGANIZATION	0.99+
second set	QUANTITY	0.99+
both sets	QUANTITY	0.99+
over three years	QUANTITY	0.99+
Ahana	PERSON	0.98+
Kubernetes	ORGANIZATION	0.98+
Spark	TITLE	0.97+
Girish Baliga	PERSON	0.97+
about half a million	QUANTITY	0.97+
today	DATE	0.97+
over a dozen members	QUANTITY	0.96+
one	QUANTITY	0.96+
Presto	TITLE	0.96+
SQL	TITLE	0.95+
single	QUANTITY	0.95+
thousands of machines	QUANTITY	0.94+
every single second	QUANTITY	0.93+
Girish Baliga Presto Foundation	ORGANIZATION	0.92+
prestodb.io	OTHER	0.91+
last couple years	DATE	0.9+
4K	OTHER	0.89+
Startup Showcase	EVENT	0.88+
one vendor	QUANTITY	0.88+

*** UNLISTED Kumar Sreekanti, BlueData | CUBEConversation, May 2018

(upbeat trumpet music) >> From our studios in the heart of Silicon Valley, Palo Alto, California. This is a CUBE Conversation. >> Welcome, everybody, I'm Dave Vellante and we're here in our Palo Alto studios and we're going to talk about big data. For the last ten years, we've seen organizations come to the realization that data can be used to drive competitive advantage and so they dramatically lowered the cost of collecting data. We certainly saw this with Hadoop, but you know what data is plentiful, insights aren't. Infrastructure around big data is very challenging. I'm here with Kumar Sreekanti, co-founder and CEO of BlueData, and a long time friend of mine. Kumar, it's great to see you again. Thanks so much for coming to theCUBE. >> Thank you, Dave, thank you. Good to see you as well. >> We've had a number of conversations over the years, the Hadoop days, on theCUBE, you and I go way back, but I said up front, big data sounded so alluring, but it's very, very complex to get started and we're going to get into that. I want to talk about BlueData. Recently sold to company to HPE, congratulations. >> Thank you, thank you. >> It's fantastic. Go back, why did you start BlueData? >> When I started BlueData, prior to that I was at VMware and I had a great opportunity to be in the driving seat, working with many talented individuals, as well as with many customers and CIOs. I saw while VMware solved the problem of single instance of virtual machines and transform the data center, I see the new wave of distributed systems, vis-a-vis first example of that is Hadoop, were quite rigid. They were running on bare metal and they were not flexible. They were having, customers, a lot of issues, the ones that you just talked about. There's a new stack coming up everyday. They're running on bare metal. I can't run the production and the DevOps on the same systems. Whereas the cloud was making progress so we felt that there is an opportunity to build a Vmware-like platform that focuses on big data applications. This was back in 2013, right. That was the early genesis. We saw that data is here and data is the new oil as many people have said and the organizations have to figure out a way to harness the power of that and they need an invisible infrastructure. They need very innovative platforms. >> You know, it's funny. We see data as even more valuable than oil because you can only once. (Kumar laughs) You can use data many, many times. >> That's a very good one. >> Companies are beginning to realize that and so talk about the journey of big data. You're a product guy. You've built a lot of products, highly technical. You know a lot of people in the valley. You've built great teams. What was the journey like with BlueData? >> You know, a lot of people would like it to be a straight line from the starting to that point. (Dave laughs) It is not, it's fascinating. At the same time, a stressful, up and downs journey, but very fulfilling. A, this is probably one of the best products that I've built in my career. B, it actually solves a real problem to the customers and in the process you actually find a lot of satisfaction not only building a great product. It actually building the value for the customers. Journey has been very good. We were very blessed with extremely good advisors from the right beginning. We were really fortunate to have good investors and I was very, as you said, my knowledge and my familiarity in the valley, I was able to build a good team. Overall, an extremely good journey. It's putting a bow on the top, as you pointed out, the exit, but it's a good journey. There's a lot of nuance I learned in the process. I'm happy to share as we go through. >> Let's double-click on the problem. We talked a little bit about it. You referenced it. Everyday there's a new open source project coming out. There's The Scoop and The Hive and a new open open source database coming out. Practitioners are challenged. They don't have the skillsets. The Ubers and the Facebooks, they could probably figure it out and have the engineers to do it, but the average enterprise may not. Clearly complexity is the problem, but double-click on that and talk a little bit about, from your perspective, what that challenge is. >> That's a very good point. I think when we started the company, we exactly noticed that. There are companies that have the muscle to hire the set of engineers and solve the problem, vertically specific to their application or their use case, but the average, which is Fortune 500 companies, do not have that kind of engineering man power. Then I also call this day two operations. When you actually go back to Vmware or Windows, as soon as you buy the piece of software, next day it's operational and you know how to use it, but with these new stacks, by the time stack is installed, you already have a newer version. It's actually solutions-led meaning that you want to have a solution understanding, but you want to make the infrastructure invisible meaning, I want to create a cluster or I want to funnel the data. I don't want to think about those things. I just wanted to directly worry about what is my solution and I want BlueData to worry about creating me a cluster, automating it. It's automation, automation, automation, orchestration, orchestration, orchestration. >> Okay, so that's the general way in which you solve this problem. Automate, you got to take the humans out of the equation. Talk specifically about the BlueData architecture. What's the secret sauce behind it? >> We were very fortunate to see containers as the new lightweight virtual machines. We have taken an approach. There are certain applications, particularly stateful, need a different handling than cloud-native non-stateful applications so what we said was, in fact our architecture predates Kubernetes, so we built a bottoms-up, pure white-paper architecture that is geared towards big data, AIML applications. Now, actually, even HPC is starting to move into that direction. >> Well, tell me actually, talk a little bit about that in terms of the evolution of the types of workloads that we've seen. You know, it started all out, Hadoop was batch, and then very quickly that changed. Talk about that spectrum. >> It's actually when we started, the highest ask from the customers were Hadoop and batch processing, but everybody knew that was the beginning and with the streaming and the new streaming technologies, it's near realtime analytics and moving to now AIML applications like H2O and Cafe and now I'm seeing the customer's asking and say, I would like to have a single platform that actually runs all these applications to me. The way we built it, going back to your previous question, the architecture is, our goal is for you to be able to create these clusters and not worry about the copying the data, single copy of the data. We built a technology called DataTap which we talked about in the past and that allows you to have a single copy of the data and multiple applications to be able to access that. >> Now, HPC, you mentioned HPC. It used to be, maybe still is, this sort of crazy crowd. (laughter) You know, they do things differently and everybody bandwidth, bandwidth, bandwidth and very high-end performance. How do you see that fitting in? Do you see that going mainstream? >> I'm glad you pointed out because I'm not saying everything is moving over, but I am starting to see, in fact, I was in a conversation this morning with an HPC team and an HPC customer. They are seeing the value of the scale of distributed systems. HPC tend to be scale up and single high bandwidth. They are seeing the value of how can I actually bring these two pieces together? I would say it's in infancy. Don't take me to say, look how long Hadoop take, 10 years so it's probably going to take a longer time, but I can see enterprises thinking of a single unified platform that's probably driven by Kubernetes and have these applications instantiated, orchestrated, and automated on that type. >> Now, how about the cloud? Where does that fit? We often say in theCUBE that it's not Moore's Law anymore. The innovation cocktail is data, all this data that we've collected, applying machine intelligence, and then scaling with the cloud. Obviously cloud is hugely important. It gobbled up the whole Hadoop business, but where do you see it fitting? >> Cloud is a big elephant in the room. We all have to acknowledge. I think it provides significant advantages. I always used to say this, and I may have said this in my previous CUBE interviews, cloud is all about the innovation. The reason cloud got so much traction, is because if you compare the amount of innovation to on-prem, they were at least five years ahead of that. Even the BlueData technology that we brought to the barer, EMR on Amazon was in front of the data, but it was only available Amazon. It's what we call an opinionated stack. That means you are forced to use what they give you as opposed to, I want to bring my own piece of software. We see cloud, as well as on-prem pretty much homogenous. In fact, BlueData software runs both on-prem, on the cloud, in a hybrid fashion. Same software and you can bring your stack on the top of the BlueData. >> Okay, so hybrid was the next piece of it. >> What we see is cloud has, at least from the angle from my exposure, cloud is very useful for certain applications, especially what I'm seeing is, if you are collecting the large amounts of data on the cloud, I would rather run a batch processing and curate the data and bring the very important amount of data back into the on-prem and run some realtime. It's just one example. I see a balance between the two. I also see a lot of organizations still collecting terabits of data on-prem and they're not going to take terabits of data overnight to the cloud. We are seeing all the customers asking, we would like to see a hybrid solution. >> The reason I like the acquisition by HPE because not only is it a company started by a friend and someone that I respect and knows how to build solid technology that can last, but it's software. HPE, as a company, my view needs more software content. (Kumar laughs) Software's eating the world as Marc Andressen says. It would be great to see that software live as an independent entity. I'm sure decisions are still being made, but how do you see that playing out? What are the initial discussions like? What can you share with us? >> That's a very, very, well put there. Currently, the goal from my boss and the teams there is, we want to keep the BlueData software independent. It runs on all x86 hardware platforms and we want to drive the roadmap driven by the customer needs on the software like we want to run more HPC applications. Our roadmap will be driven by the customer needs and the change in the stack on the top, not by necessarily the hardware. >> Well, that fits with HPE's culture of always trying to give optionality and we've had this conversation many, many times with senior-level people like Antonio. It's very important that there's no lock-in, open mindset, and certainly HPE lives up to that. Thanks so much for coming-- >> You're welcome. Back into theCUBE. >> I appreciate you having me here as well. >> Your career has been amazing as we go back a long time. Wow. From hardware, software, all these-- >> Great technologies. (laughter) >> Yeah, solving hard problems and we look forward to tracking your career going forward. >> Thank you, thank you. Thanks so much. >> And thank you for watching, everybody. This is Dave Vellante from our Palo Alto Studios. We'll see ya next time. (upbeat trumpet music)

Published Date : Mar 22 2019

SUMMARY :

in the heart of Silicon Valley, Palo Alto, California. Kumar, it's great to see you again. Good to see you as well. the Hadoop days, on theCUBE, you and I go way back, Go back, why did you start BlueData? and the organizations have to figure out a way because you can only once. and so talk about the journey of big data. and in the process you actually find a lot and have the engineers to do it, There are companies that have the muscle Okay, so that's the general way as the new lightweight virtual machines. in terms of the evolution of the types of workloads in the past and that allows you to have a single copy and very high-end performance. They are seeing the value of the scale Now, how about the cloud? Even the BlueData technology that we brought to the barer, and curate the data and bring the very important amount What are the initial discussions like? and the change in the stack on the top, and certainly HPE lives up to that. You're welcome. Your career has been amazing as we go back a long time. (laughter) and we look forward to tracking your career going forward. Thanks so much. And thank you for watching, everybody.

ENTITIES

Entity	Category	Confidence
Marc Andressen	PERSON	0.99+
Dave Vellante	PERSON	0.99+
2013	DATE	0.99+
Dave	PERSON	0.99+
Kumar	PERSON	0.99+
HPE	ORGANIZATION	0.99+
BlueData	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
10 years	QUANTITY	0.99+
Kumar Sreekanti	PERSON	0.99+
Ubers	ORGANIZATION	0.99+
VMware	ORGANIZATION	0.99+
Facebooks	ORGANIZATION	0.99+
May 2018	DATE	0.99+
Palo Alto	LOCATION	0.99+
two	QUANTITY	0.99+
two pieces	QUANTITY	0.99+
first	QUANTITY	0.99+
one example	QUANTITY	0.99+
one	QUANTITY	0.98+
both	QUANTITY	0.98+
single copy	QUANTITY	0.97+
Antonio	PERSON	0.97+
Hadoop	TITLE	0.96+
single platform	QUANTITY	0.96+
Windows	TITLE	0.95+
single	QUANTITY	0.95+
Kubernetes	TITLE	0.94+
Silicon Valley,	LOCATION	0.94+
Palo Alto Studios	ORGANIZATION	0.91+
Hadoop	ORGANIZATION	0.91+
this morning	DATE	0.9+
Vmware	ORGANIZATION	0.85+
single instance	QUANTITY	0.85+
Palo Alto, California	LOCATION	0.84+
least five years	QUANTITY	0.84+
CUBE	ORGANIZATION	0.83+
double	QUANTITY	0.81+
once	QUANTITY	0.81+
BlueData	OTHER	0.79+
H2O	TITLE	0.79+
two operations	QUANTITY	0.78+
Cafe	TITLE	0.78+
Kubernetes	ORGANIZATION	0.76+
500	QUANTITY	0.76+
HPE	TITLE	0.73+
next day	DATE	0.68+
Moore	PERSON	0.68+
last ten years	DATE	0.64+
DataTap	TITLE	0.64+
The Scoop	TITLE	0.59+
CUBEConversation	EVENT	0.56+
BlueData	TITLE	0.52+
The Hive	TITLE	0.49+
theCUBE	ORGANIZATION	0.42+

Josh Rogers, Syncsort | CUBEConversation, November 2018

>> From the SiliconANGLE media office in Boston, Massachusetts it's theCUBE. Now, here's your host Stu Miniman. >> Hi, I'm Stu Miniman and welcome to our Boston area studio. I'm happy to welcome back to the program a multi-time guest, Josh Rogers, who's the CEO of Syncsort. Josh, great to see ya. >> Great to see you. Thanks for having me. >> Alright so, Syncsort is a company that I would say is, you guys are deep in the data ocean. Data is at the center of everything. When Wikibon, when we did our predictions everything whether you're talking about cloud, whether you're talking about infrastructure, of course everything like IoT and Edge, it is at the center of it. I want you to help start off is there's this term, big iron, big data. Help explain to us what that is and what that means to both Syncsort and your customers. >> Sure yeah, so we like to talk about Syncsort as the leader in big iron to big data and it's a it's a positioning that we've chosen for the firm because we think it represents the value proposition that we bring to our customers but we also think it represents a collection of use cases that are really at the top of the agenda of CIOs today. And really we talk about it in two areas. The first is a recognition that large enterprises still run mission critical workloads on systems that they've built over the last 20, 30, 40 years. Those systems leverage mainframe computing, they leveraging IBM i or AS400 and they spent trillions of dollars building those systems and they still deliver core workloads that power their businesses. So mission number one is that these firms want to make sure that they optimize those environments. They run them as efficiently as possible. They can't go down. They've got the proper security kind of protocols around them and of course that situation's always changing as workloads grow and change on these environments. So first is how do I optimize the systems that while they may be mature, they are still mission critical. The second is a recognition that most of the critical data assets for our customers are created in these systems. These are the systems that execute the transactions and as a result have core information around the results of the firm, the firm's customers, et cetera. So second value proposition is how do I maximize the value of that data that gets produced in those systems which tends to be a focus on liberating it, making a copy of it and moving it into next generation analytic systems. And then you look at the technical requirements of that it turns out that it's hard. I'm taking data from systems that were created 50 years ago and I'm integrating it with systems that were created five years ago. And so we've got a special set of expertise and solutions that allow customers to both optimize these old systems and maximize the value data produced in those systems. >> You bring up some really good points. I've been talking the last couple of years to people about how do I really wrap my arms around my data and we're talking about a multi-cloud world and where we have pockets of information trapped. That's a challenge. So it's not just about my data center and Amazon. It's like oh wait, I've got all these SaaS deployments and I think it's probably, it's a blind spot that I had had as to sure, right, you've got companies that have let's call them legacy systems, ones that they've got a lot investment but these are mission critical, these are the ones that it is not easy to modernize them but if I can get access to the data and put this into these next generation systems it sounds like you kind of free that data and allow that to be leveraged much easier. >> That's right, that's right and we, what we try to do is focus on what are the next generation trends in data and how are they going to intersect with these older systems. And so that started as big data but it includes cloud and the multi-cloud. It includes real-time and IoT. It includes thing like Blockchain. We're really scanning the horizon for what are these kind of generational shifts in terms of how am I going to leverage data and how do we get really tight on the use cases that our customers are gonna need. So I'll integrate those new technologies with these old investments. >> Josh, I'd love to hear what you're seeing from customers. So we've talked to you at some of the big data shows. I know we've spoken to you at the Splunk shows. I felt like what we as an industry got bogged down in some of the tools for a couple of years. While Wikibon, we did the first market forecast on big data everybody was like oh, Hadoop Hadoop Hadoop and we're like well, Hadoop will catalyze a lot of things and companies will rod a lot of things but Hadoop itself will be a small piece of the market and we've started to see some consolidation in that market. So data and the value that I get out of the data is the important thing. So what are your customers focused on? How do they get from their traditional data warehouses to a more modern? What are the challenges that they're dealing with and where are you engaging with them? >> Right, sure. So I mean one of the challenges they do have is this explosion of kind of options. Am I doing things in Hadoop? What is Hadoop at this point? Which projects actually constitute Hadoop? So what repository I'm gonna use. Am I gonna use Hive? Am I gonna use something, am I gonna use MongoDB, Elastic? What are, what's the repository I'm targeting? Generally what we see is that each of those has, and a long list of additional repositories, has a role to play for the specific use case. And then how am I going to get the data there and integrate it and then get the data out and deliver insights? And that stack of technologies and tools is pretty intimidating. And so we see customers starting to coalesce around some market leaders in that space. The merger of Hortonworks and Cloudera I think was a very good thing for the industry. It just simplifies the life of the customer in terms of making decisions in confidence in that stack. It certainly simplifies our life as a partner of those firms and I think it will help accelerate maturity in that tech stack. And so I think we're starting to see pockets of maturation which I think will accelerate customers' investments in leveraging these next generation technologies. That then creates a big opportunity for us because now it's becoming real. Now I really have to get on a real-time basis my data out of my mainframe or my IBM i system into these next generation repositories and it turns out that's technically a challenge and so what we're seeing in our businesses real acceleration of our big data solutions against what I would say production-targeted workloads and projects, which is great. >> Alright, M&A, you got a always really active in this space. We've done ThinkSort for many years so we've watched some of the changes along the way. I believe you've got some news to share regarding M&A activity and there's also some recent stuff to tap in the last year. Maybe bring us up to speed. >> Sure so we've made two announcements. We made an announcement in the last few weeks and then one very recently that I'd like to share. The first is about two months ago we struck up a developmental relationship with IBM around their B2B collaboration portfolio and this product set really gives us exposure to integration styles between businesses. Historically we've been focused on integration within a business and so we really like the exposure to that. More importantly, it intersects with one of these next generational data themes around Blockchain and we believe there's a huge opportunity to help be a leader and how do you take Blockchain infrastructure and integrate it to these existing systems. So we're really excited to partner with IBM on that front. And IBM obviously is making huge investments there. >> Before we got, what's Syncsort's play there when it comes to Blockchain? We have definitely talked to IBM quite a bit about Blockchain, Hyperledger, everything going into there. So maybe give a little more color there. >> Sure, so look, we still think that production workloads on Blockchain are a few years out and we see a lot of pilot activity. So I think people are still trying to understand the specific use cases they're gonna deliver real value. But one thing is for certain, that as customers start to stand up production workloads on the Blockchain they're going to need to integrate what's happening in that new infrastructure with these traditional systems that are still managing the large majority of their transactions. And how do I add data to the Blockchain? How do I verify data on the Blockchain? How do I improve the quality of data on the Blockchain? How do I pull data off of the Blockchain? We think there's a really important role for us to play around understanding the specifics of those use cases, how they intersect with some of these legacy systems and how we provide tailored solutions that are best in class. And it's one of the reasons, it's one of the primary reasons we've struck up the relationship with IBM but also joined Hyperledger. So hopefully that gives you a little bit more context. >> That's great. >> The more recent announcement I want to make is that we've acquired a company called Eview and Eview is a terrific leader in the machine data integration space. They have a number of solutions that are complementary to what we've done with our iron string product and what we're trying to do there is support as many use cases as possible for people to maximize the value of that they can get out of machine data, particularly as it relates to older systems like mainframe and IBM i. And what this acquisition does is it allows us to take another step forward in terms of the value proposition that we offer our customers. One specific use case where Eview's been a leader that we're very excited about is integration with ServiceNow. And you can think of ServiceNow as kind of a next generation platform that we to date have not had integration with. This acquisition gives us that integration. It also gives us a set of technology and talent that we can put towards accelerating our overall big data plans. And so we're really excited about having the Evue team join the Syncsort family and what we can deliver for customers. >> Yeah great great. Absolutely, companies like ServiceNow and Workday, huge amounts of data there, are seeing a lot of it. Dave Alonte's been at the ServiceNow knowledge show with theCUBE for a number of years. Really interesting. Seems like this acquisition ties well in with I believe it was Vision that a year ago? >> Well so it ties in mostly with our iron string product. >> Okay. >> Now Vision contributed to the iron string product in that that gave us the expertise to deliver integration for IBM i log data into next generation analytic platforms like Splunk and Elastic. So we had built a product that was focused on delivering mainframe data in real-time to those platforms. Vision gave us both real-time capability and a huge franchise in the IBM i space. Eview builds on that and gives us additional capability in terms of delivering data to new repositories like ServiceNow. >> Great, maybe step back for a second. Give us kind of some of the speeds and feeds of Syncsort itself. Memento the company, you've been CEO for a while now. Tell us how we're doing. >> Yeah, we're doing well. We're having a record year. It's important to actually recognize that in September we celebrated our 50th anniversary. So I think we're a bit unusual in terms of our heritage. Having said that, we've never driven more innovation than we have over the last 12 months. We have tripled the size of the business over the last three years since I've been CEO. We've quadrupled the employee base. And we will continue to see I think rapid growth given the opportunity we set and we see in this big iron to big data space. >> Yeah, Josh, you talk about that. When I look at okay, a 50-year-old company. We talked about data quite a bit differently 50 years ago. What is the digital transformation today? What does that mean for Syncsort? What does that mean for your customers? Help put us in context. >> Yeah, I mean, it kind of goes back to this original positioning which is, the largest banks int he world, the largest telecommunications vendors in the world, healthcare, government, you pick the industry, they built a set of systems that they still run today over the last four or five decades. Those systems tend to produce the most important data of that enterprise, not the only data you want to analyze, but it tends to be that reference data that makes everything else, allows you to make sense of everything else. And as you think about how am I gonna analyze that data, how am I gonna maximize the value of that data there is a need to integrate the data and move it off of those platforms and into these next generation platforms. And if you look at the way a vSAN file was designed for computing requirements in 1970 it turns out it's really different than the way that you would design a file type JSON or a file for Impala. And so kind of knitting that together takes a lot of deep expertise on both sides of the equation and we uniquely have that expertise and are solving that. And what we've seen is as new technologies continue to come to market, which we refer to as the next wave, that our enterprise customer base of 7,000 customers needs a partner that can say how do I take advantage of that new technology trend in the context of the past 30, 40, 50 years of investment I've made in mission critical systems and how do I support the key integration use cases? And that's what we've determined where we can make a difference in the market is focusing on what are those use cases and how do we deliver differentiate solutions to solve 'em that help both our customers and these partners. >> Absolutely, it's always great to talk about some of the new stuff but you need to meet the customers where they are, get to that data where it is and help move it forward. Alright, Josh, why don't you give it the final words? Kind of broadly open. Big challenges, opportunities, what's exciting you as you look forward kind of the next six months? >> Yeah, so we'll continue to make investments in cloud, in data governance, in supporting real-time data streaming and in security. Those are the areas that we'll be focused on driving innovation and delivering additional capability to our customers. Some of that will come through taking technologies like Eview or like the B2B products and enhancing them for specific use cases where they intersect those things. It will also be additional investments from an acquisition perspective in those domains and you can count on Syncsort to continue to expand the value proposition that it is delivering to its customers both through new technology introductions but also through additional integration with these next generation platforms. So we're really excited I mean, we believe our strategy is working. It's led to record results in our 50th year and we think we've got many years to run with this strategy. >> Alright well Josh Rogers, CEO of Syncsort. Congratulations on the progress. New acquisition, deeper partnership with IBM and I look forward to tracking the updates. >> Thanks so much. Appreciate the opportunity. >> Alright, and thank you as always for joining. I'm Stu Miniman. Thanks for watching theCUBE. (upbeat electronic music)

Published Date : Nov 27 2018

SUMMARY :

From the SiliconANGLE media office and welcome to our Boston area studio. Great to see you. Data is at the center of everything. and of course that situation's always changing and allow that to be leveraged much easier. and how are they going to intersect What are the challenges that they're dealing with So I mean one of the challenges they do have and there's also some recent stuff to tap in the last year. and integrate it to these existing systems. We have definitely talked to IBM quite a bit that are still managing the large majority that are complementary to what we've done Dave Alonte's been at the ServiceNow knowledge show and a huge franchise in the IBM i space. Memento the company, you've been CEO for a while now. and we see in this big iron to big data space. What is the digital transformation today? and how do I support the key integration use cases? some of the new stuff and we think we've got many years to run with this strategy. and I look forward to tracking the updates. Appreciate the opportunity. Alright, and thank you as always for joining.

ENTITIES

Entity	Category	Confidence
Josh	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Dave Alonte	PERSON	0.99+
Josh Rogers	PERSON	0.99+
Syncsort	ORGANIZATION	0.99+
Hortonworks	ORGANIZATION	0.99+
Boston	LOCATION	0.99+
Amazon	ORGANIZATION	0.99+
Stu Miniman	PERSON	0.99+
September	DATE	0.99+
1970	DATE	0.99+
7,000 customers	QUANTITY	0.99+
November 2018	DATE	0.99+
50th year	QUANTITY	0.99+
Cloudera	ORGANIZATION	0.99+
two areas	QUANTITY	0.99+
last year	DATE	0.99+
second	QUANTITY	0.99+
Wikibon	ORGANIZATION	0.99+
first	QUANTITY	0.99+
Boston, Massachusetts	LOCATION	0.99+
Eview	ORGANIZATION	0.99+
Hadoop	TITLE	0.99+
each	QUANTITY	0.99+
a year ago	DATE	0.99+
one	QUANTITY	0.99+
Vision	ORGANIZATION	0.98+
both	QUANTITY	0.98+
both sides	QUANTITY	0.98+
50th anniversary	QUANTITY	0.98+
Hyperledger	ORGANIZATION	0.98+
trillions of dollars	QUANTITY	0.98+
five years ago	DATE	0.98+
two announcements	QUANTITY	0.98+
MongoDB	TITLE	0.98+
Edge	ORGANIZATION	0.97+
50 years ago	DATE	0.97+
30	QUANTITY	0.96+
today	DATE	0.96+
JSON	TITLE	0.95+
ServiceNow	TITLE	0.94+
50-year-old	QUANTITY	0.93+
Elastic	TITLE	0.92+
Evue	ORGANIZATION	0.9+
40 years	QUANTITY	0.9+
M&A	ORGANIZATION	0.9+
AS400	COMMERCIAL_ITEM	0.89+
next six months	DATE	0.88+
one thing	QUANTITY	0.87+
Splunk	TITLE	0.87+
40, 50 years	QUANTITY	0.86+
last 12 months	DATE	0.86+
ThinkSort	ORGANIZATION	0.84+
Hive	TITLE	0.84+

Steve Wooledge, Arcadia Data & Satya Ramachandran, Neustar | DataWorks Summit 2018

(upbeat electronic music) >> Live from San Jose, in the heart of Silicon Valley, it's theCUBE. Covering Dataworks Summit 2018, brought to you by Hortonworks. (electronic whooshing) >> Welcome back to theCUBE's live coverage of Dataworks, here in San Jose, California. I'm your host, Rebecca Knight, along with my co-host, James Kobielus. We have two guests in this segment, we have Steve Wooledge, he is the VP of Product Marketing at Arcadia Data, and Satya Ramachandran, who is the VP of Engineering at Neustar. Thanks so much for coming on theCUBE. >> Our pleasure and thank you. >> So let's start out by setting the scene for our viewers. Tell us a little bit about what Arcadia Data does. >> Arcadia Data is focused on getting business value from these modern scale-out architectures, like Hadoop, and the Cloud. We started in 2012 to solve the problem of how do we get value into the hands of the business analysts that understand a little bit more about the business, in addition to empowering the data scientists to deploy their models and value to a much broader audience. So I think that's been, in some ways, the last mile of value that people need to get out of Hadoop and data lakes, is to get it into the hands of the business. So that's what we're focused on. >> And start seeing the value, as you said. >> Yeah, seeing is believing, a picture is a thousand words, all those good things. And what's really emerging, I think, is companies are realizing that traditional BI technology won't solve the scale and user concurrency issues, because architecturally, big data's different, right? We're on the scale-out, MPP architectures now, like Hadoop, the data complexity and variety has changed, but the BI tools are still the same, and you pull the data out of the system to put it into some little micro cube to do some analysis. Companies want to go after all the data, and view the analysis across a much broader set, and that's really what we enable. >> I want to hear about the relationship between your two companies, but Satya, tell us a little about Neustar, what you do. >> Neustar is an information services company, we are built around identity. We are the premiere identity provider, the most authoritative identity provider for the US. And we built a whole bunch of services around that identity platform. I am part of the marketing solutions group, and I head the analytics engineering for marketing solutions. The product that I work on helps marketers do their annual planning, as well as their campaign or tactical planning, so that they can fine tune their campaigns on an ongoing basis. >> So how do you use Arcadia Data's primary product? >> So we are a predictive analytics platform, the reporting solution, we use Arcadia for the reporting part of it. So we have multi terabytes of advertising data in our values, and so we use Arcadia to provide fast taxes to our customers, and also very granular and explorative analysis of this data. High (mumbles) and explorative analysis of this data. >> So you say you help your customers with their marketing campaigns, so are you doing predictive analytics? And are you during churn analysis and so forth? And how does Arcadia fit into all of that? >> So we get data and then they build an activation model, which tells how the marketing spent corresponds to the revenue. We not only do historical analysis, we also do predictive, in the sense that the marketers frequently done what-if analysis, saying that, what if I moved my budget from page search to TV? And how does it affect the revenue? So all of this modeling is built by Neustar, the modeling platform is built by the Neustar, but the last mile of taking these reports and providing this explorative analysis of the results, that is provided by the reporting solution, which is Arcadia. >> Well, I mean, the thing about data analytics, is that it really is going to revolutionize marketing. That famous marketing adage of, I know my advertising works, I just don't know which half, and now we're really going to be able to figure out which half. Can you talk a little bit about return on investment and what your clients see? >> Sure, we've got some major Fortune 500 companies that have said publicly that they've realized over a billion dollars of incremental value. And that could be across both marketing analytics, and how we better treat our messaging, our brand, to reach our intended audience. There's things like supply chain and being able to more realtime analyze what-if analysis for different routes, it's things like cyber security and stopping fraud and waste and things like that at a much grander scale than what was really possible in the past. >> So we're here at Dataworks and it's the Hortonworks show. Give us a sense of the degree of your engagement or partnership with Hortonworks and participation in their partner ecosystem. >> Yeah, absolutely. Hortonworks is one of our key partners, and what we did that's different architecturally, is we built our BI server directly into the data platforms. So what I mean by that is, we take the concept of a BI server, we install it and run it on the data nodes of Hortonworks Data Platform. We inherit the security directly out of systems like Apache Ranger, so that all that administration and scale is done at Hadoop economics, if you will, and it leverages the things that are already in place. So that has huge advantages both in terms of scale, but also simplicity, and then you get the performance, the concurrency that companies need to deploy out to like, 5,000 users directly on that Hadoop cluster. So, Hortonworks is a fantastic partner for us and a large number of our customers run on Hortonworks, as well as other platforms, such as Amazon Web Services, where Satya's got his system deployed. >> At the show they announced Hortonworks Data Platform 3.0. There's containerization there, there's updates to Hive to enable it to be more of a realtime analytics, and also a data warehousing engine. In Arcadia Data, do you follow their product enhancements, in terms of your own product roadmap with any specific, fixed cycle? Are you going to be leveraging the new features in HDP 3.0 going forward to add value to your customers' ability to do interactive analysis of this data in close to realtime? >> Sure, yeah, no, because we're a native-- >> 'Cause marketing campaigns are often in realtime increasingly, especially when you're using, you know, you got a completely digital business. >> Yeah, absolutely. So we benefit from the innovations happening within the Hortonworks Data Platform. So, because we're a native BI tool that runs directly within that system, you know, with changes in Hive, or different things within HDFS, in terms of performance or compression and things like that, our customers generally benefit from that directly, so yeah. >> Satya, going forward, what are some of the problems that you want to solve for your clients? What is their biggest pain points and where do you see Neustar? >> So, data is the new island, right? So, marketers, also for them now, data is the biggest, is what they're going after. They want faster analysis, they want to be able to get to insights as fast as they can, and they want to obviously get, work on as large amount of data as possible. The variety of sources is becoming higher and higher and higher, in terms of marketing. There used to be a few channels in '70s and '80s, and '90s kind of increased, now you have like, hundreds of channels, if not thousands of channels. And they want visibility across all of that. It's the ability to work across this variety of data, increasing volume at a very high speed. Those are high level challenges that we have at Neustar. >> Great. >> So the difference, marketing attribution analysis you say is one of the core applications of your solution portfolio. How is that more challenging now than it had been in the past? We have far more marketing channels, digital and so forth, then how does the state-of-the-art of marketing attribution analysis, how is it changing to address this multiplicity of channels and media for advertising and for influencing the customer on social media and so forth? And then, you know, can you give us a sense for then, what are the necessary analytical tools needed for that? We often hear about a social graph analysis or semantic analysis, or for behavioral analytics and so forth, all of this makes it very challenging. How can you determine exactly what influences a customer now in this day and age, where, you think, you know, Twitter is an influencer over the conversation. How can you nail that down to specific, you know, KPIs or specific things to track? >> So I think, from our, like you pointed out, the variety is increasing, right? And I think the marketers now have a lot more options than what they have, and that that's a blessing, and it's also a curse. Because then I don't know where I'm going to move my marketing spending to. So, attribution right now, is still sitting at the headquarters, it's kind of sitting at a very high level and it is answering questions. Like we said, with the Fortune 100 companies, it's still answering questions to the CMOs, right? Where attribution will take us, next step is to then lower down, where it's able to answer the regional headquarters on what needs to happen, and more importantly, on every store, I'm able to then answer and tailor my attribution model to a particular store. Let's take Ford for an example, right? Now, instead of the CMO suite, but, if I'm able to go to every dealer, and I'm able to personal my attribution to that particular dealer, then it becomes a lot more useful. The challenge there is it all needs to be connected. Whatever model we are working for the dealer, needs to be connected up to the headquarters. >> Yes, and that personalization, it very much leverages the kind of things that Steve was talking about at Arcadia. Being able to analyze all the data to find those micro, micro, micro segments that can be influenced to varying degrees, so yeah. I like where you're going with this, 'cause it very much relates to the power of distributing federated big data fabrics like Hortonworks' offers. >> And so it's streaming analytics is coming to forward, and it's been talked about for the past longest period of time, but we have real use cases for streaming analytics right now. Similarly, the large volumes of the data volumes is, indeed, becoming a lot more. So both of them are doing a lot more right now. >> Yes. >> Great. >> Well, Satya and Steve, thank you so much for coming on theCUBE, this was really, really fun talking to you. >> Excellent. >> Thanks, it was great to meet you. Thanks for having us. >> I love marketing talk. >> (laughs) It's fun. I'm Rebecca Knight, for James Kobielus, stay tuned to theCUBE, we will have more coming up from our live coverage of Dataworks, just after this. (upbeat electronic music)

Published Date : Jun 20 2018

SUMMARY :

brought to you by Hortonworks. the VP of Product Marketing the scene for our viewers. the data scientists to deploy their models the value, as you said. and you pull the data out of the system Neustar, what you do. and I head the analytics engineering the reporting solution, we use Arcadia analysis of the results, and what your clients see? and being able to more realtime and it's the Hortonworks show. and it leverages the things of this data in close to realtime? you got a completely digital business. So we benefit from the It's the ability to work to specific, you know, KPIs and I'm able to personal my attribution the data to find those micro, analytics is coming to forward, talking to you. Thanks, it was great to meet you. stay tuned to theCUBE, we

ENTITIES

Entity	Category	Confidence
James Kobielus	PERSON	0.99+
Steve Wooledge	PERSON	0.99+
Rebecca Knight	PERSON	0.99+
Satya Ramachandran	PERSON	0.99+
Steve	PERSON	0.99+
Hortonworks	ORGANIZATION	0.99+
Neustar	ORGANIZATION	0.99+
Arcadia Data	ORGANIZATION	0.99+
Ford	ORGANIZATION	0.99+
Satya	PERSON	0.99+
2012	DATE	0.99+
San Jose	LOCATION	0.99+
two companies	QUANTITY	0.99+
Silicon Valley	LOCATION	0.99+
two guests	QUANTITY	0.99+
Arcadia	ORGANIZATION	0.99+
San Jose, California	LOCATION	0.99+
Amazon Web Services	ORGANIZATION	0.99+
US	LOCATION	0.99+
both	QUANTITY	0.99+
Hortonworks'	ORGANIZATION	0.99+
5,000 users	QUANTITY	0.99+
Dataworks	ORGANIZATION	0.98+
theCUBE	ORGANIZATION	0.98+
one	QUANTITY	0.97+
Twitter	ORGANIZATION	0.96+
hundreds of channels	QUANTITY	0.96+
Dataworks Summit 2018	EVENT	0.96+
DataWorks Summit 2018	EVENT	0.93+
thousands of channels	QUANTITY	0.93+
over a billion dollars	QUANTITY	0.93+
Data Platform 3.0	TITLE	0.9+
'70s	DATE	0.86+
Arcadia	TITLE	0.84+
Hadoop	TITLE	0.84+
HDP 3.0	TITLE	0.83+
'90s	DATE	0.82+
Apache Ranger	ORGANIZATION	0.82+
thousand words	QUANTITY	0.76+
HDFS	TITLE	0.76+
multi terabytes	QUANTITY	0.75+
Hive	TITLE	0.69+
Neustar	TITLE	0.67+
Fortune	ORGANIZATION	0.62+
80s	DATE	0.55+
500	QUANTITY	0.45+
100	QUANTITY	0.4+
theCUBE	TITLE	0.39+

Ram Venkatesh, Hortonworks & Sudhir Hasbe, Google | DataWorks Summit 2018

>> Live from San Jose, in the heart of Silicon Valley, it's theCUBE, covering DataWorks Summit 2018. Brought to you by HortonWorks. >> We are wrapping up Day One of coverage of Dataworks here in San Jose, California on theCUBE. I'm your host, Rebecca Knight, along with my co-host, James Kobielus. We have two guests for this last segment of the day. We have Sudhir Hasbe, who is the director of product management at Google and Ram Venkatesh, who is VP of Engineering at Hortonworks. Ram, Sudhir, thanks so much for coming on the show. >> Thank you very much. >> Thank you. >> So, I want to start out by asking you about a joint announcement that was made earlier this morning about using some Hortonworks technology deployed onto Google Cloud. Tell our viewers more. >> Sure, so basically what we announced was support for the Hortonworks DataPlatform and Hortonworks DataFlow, HDP and HDF, running on top of the Google Cloud Platform. So this includes deep integration with Google's cloud storage connector layer as well as it's a certified distribution of HDP to run on the Google Cloud Platform. >> I think the key thing is a lot of our customers have been telling us they like the familiar environment of Hortonworks distribution that they've been using on-premises and as they look at moving to cloud, like in GCP, Google Cloud, they want the similar, familiar environment. So, they want the choice to deploy on-premises or Google Cloud, but they want the familiarity of what they've already been using with Hortonworks products. So this announcement actually helps customers pick and choose like whether they want to run Hortonworks distribution on-premises, they want to do it in cloud, or they wat to build this hybrid solution where the data can reside on-premises, can move to cloud and build these common, hybrid architecture. So, that's what this does. >> So, HDP customers can store data in the Google Cloud. They can execute ephemeral workloads, analytic workloads, machine learning in the Google Cloud. And there's some tie-in between Hortonworks's real-time or low latency or streaming capabilities from HDF in the Google Cloud. So, could you describe, at a full sort of detail level, the degrees of technical integration between your two offerings here. >> You want to take that? >> Sure, I'll handle that. So, essentially, deep in the heart of HDP, there's the HDFS layer that includes Hadoop compatible file system which is a plug-able file system layer. So, what Google has done is they have provided an implementation of this API for the Google Cloud Storage Connector. So this is the GCS Connector. We've taken the connector and we've actually continued to refine it to work with our workloads and now Hortonworks has actually bundling, packaging, and making this connector be available as part of HDP. >> So bilateral data movement between them? Bilateral workload movement? >> No, think of this as being very efficient when our workloads are running on top of GCP. When they need to get at data, they can get at data that is in the Google Cloud Storage buckets in a very, very efficient manner. So, since we have fairly deep expertise on workloads like Apache Hive and Apache Spark, we've actually done work in these workloads to make sure that they can run efficiently, not just on HDFS, but also in the cloud storage connector. This is a critical part of making sure that the architecture is actually optimized for the cloud. So, at our skill and our customers are moving their workloads from on-premise to the cloud, it's not just functional parity, but they also need sort of the operational and the cost efficiency that they're looking for as they move to the cloud. So, to do that, we need to enable these fundamental disaggregated storage pattern. See, on-prem, the big win with Hadoop was we could bring the processing to where the data was. In the cloud, we need to make sure that we work well when storage and compute are disaggregated and they're scaled elastically, independent of each other. So this is a fairly fundamental architectural change. We want to make sure that we enable this in a first-class manner. >> I think that's a key point, right. I think what cloud allows you to do is scale the storage and compute independently. And so, with storing data in Google Cloud Storage, you can like scale that horizontally and then just leverage that as your storage layer. And the compute can independently scale by itself. And what this is allowing customers of HDP and HDF is store the data on GCP, on the cloud storage, and then just use the scale, the compute side of it with HDP and HDF. >> So, if you'll indulge me to a name, another Hortonworks partner for just a hypothetical. Let's say one of your customers is using IBM Data Science Experience to do TensorFlow modeling and training, can they then inside of HDP on GCP, can they use the compute infrastructure inside of GCP to do the actual modeling which is more compute intensive and then the separate decoupled storage infrastructure to do the training which is more storage intensive? Is that a capability that would available to your customers? With this integration with Google? >> Yeah, so where we are going with this is we are saying, IBM DSX and other solutions that are built on top of HDP, they can transparently take advantage of the fact that they have HDP compute infrastructure to run against. So, you can run your machine learning training jobs, you can run your scoring jobs and you can have the same unmodified DSX experience whether you're running against an on-premise HDP environment or an in-cloud HDP environment. Further, that's sort of the benefit for partners and partner solutions. From a customer standpoint, the big value prop here is that customers, they're used to securing and governing their data on-prem in their particular way with HDP, with Apache Ranger, Atlas, and so forth. So, when they move to the cloud, we want this experience to be seamless from a management standpoint. So, from a data management standpoint, we want all of their learning from a security and governance perspective to apply when they are running in Google Cloud as well. So, we've had this capability on Azure and on AWS, so with this partnership, we are announcing the same type of deep integration with GCP as well. >> So Hortonworks is that one pane of glass across all your product partners for all manner of jobs. Go ahead, Rebecca. >> Well, I just wanted to ask about, we've talked about the reason, the impetus for this. With the customer, it's more familiar for customers, it offers the seamless experience, But, can you delve a little bit into the business problems that you're solving for customers here? >> A lot of times, our customers are at various points on their cloud journey, that for some of them, it's very simple, they're like there's a broom coming by and the datacenter is going away in 12 months and I need to be in the cloud. So, this is where there is a wholesale movement of infrastructure from on-premise to the cloud. Others are exploring individual business use cases. So, for example, one of our large customers, a travel partner, so they are exploring their new pricing model and they want to roll out this pricing model in the cloud. They have on-premise infrastructure, they know they have that for a while. They are spinning up new use cases in the cloud typically for reasons of agility. So, if you, typically many of our customers, they operate large, multi-tenant clusters on-prem. That's nice for, so a very scalable compute for running large jobs. But, if you want to run, for example, a new version of Spark, you have to upgrade the entire cluster before you can do that. Whereas in this sort of model, what they can say is, they can bring up a new workload and just have the specific versions and dependency that it needs, independent of all of their other infrastructure. So this gives them agility where they can move as fast as... >> Through the containerization of the Spark jobs or whatever. >> Correct, and so containerization as well as even spinning up an entire new environment. Because, in the cloud, given that you have access to elastic compute resources, they can come and go. So, your workloads are much more independent of the underlying cluster than they are on-premise. And this is where sort of the core business benefits around agility, speed of deployment, things like that come into play. >> And also, if you look at the total cost of ownership, really take an example where customers are collecting all this information through the month. And, at month end, you want to do closing of books. And so that's a great example where you want ephemeral workloads. So this is like do it once in a month, finish the books and close the books. That's a great scenario for cloud where you don't have to on-premises create an infrastructure, keep it ready. So that's one example where now, in the new partnership, you can collect all the data through the on-premises if you want throughout the month. But, move that and leverage cloud to go ahead and scale and do this workload and finish the books and all. That's one, the second example I can give is, a lot of customers collecting, like they run their e-commerce platforms and all on-premises, let's say they're running it. They can still connect all these events through HDP that may be running on-premises with Kafka and then, what you can do is, in-cloud, in GCP, you can deploy HDP, HDF, and you can use the HDF from there for real-time stream processing. So, collect all these clickstream events, use them, make decisions like, hey, which products are selling better?, should we go ahead and give?, how many people are looking at that product?, or how many people have bought it?. That kind of aggregation and real-time at scale, now you can do in-cloud and build these hybrid architectures that are there. And enable scenarios where in past, to do that kind of stuff, you would have to procure hardware, deploy hardware, all of that. Which all goes away. In-cloud, you can do that much more flexibly and just use whatever capacity you have. >> Well, you know, ephemeral workloads are at the heart of what many enterprise data scientists do. Real-world experiments, ad-hoc experiments, with certain datasets. You build a TensorFlow model or maybe a model in Caffe or whatever and you deploy it out to a cluster and so the life of a data scientist is often nothing but a stream of new tasks that are all ephemeral in their own right but are part of an ongoing experimentation program that's, you know, they're building and testing assets that may be or may not be deployed in the production applications. That's you know, so I can see a clear need for that, well, that capability of this announcement in lots of working data science shops in the business world. >> Absolutely. >> And I think coming down to, if you really look at the partnership, right. There are two or three key areas where it's going to have a huge advantage for our customers. One is analytics at-scale at a lower cost, like total cost of ownership, reducing that, running at-scale analytics. That's one of the big things. Again, as I said, the hybrid scenarios. Most customers, enterprise customers have huge deployments of infrastructure on-premises and that's not going to go away. Over a period of time, leveraging cloud is a priority for a lot of customers but they will be in these hybrid scenarios. And what this partnership allows them to do is have these scenarios that can span across cloud and on-premises infrastructure that they are building and get business value out of all of these. And then, finally, we at Google believe that the world will be more and more real-time over a period of time. Like, we already are seeing a lot of these real-time scenarios with IoT events coming in and people making real-time decisions. And this is only going to grow. And this partnership also provides the whole streaming analytics capabilities in-cloud at-scale for customers to build these hybrid plus also real-time streaming scenarios with this package. >> Well it's clear from Google what the Hortonworks partnership gives you in this competitive space, in the multi-cloud space. It gives you that ability to support hybrid cloud scenarios. You're one of the premier public cloud providers and we all know about. And clearly now that you got, you've had the Hortonworks partnership, you have that ability to support those kinds of highly hybridized deployments for your customers, many of whom I'm sure have those requirements. >> That's perfect, exactly right. >> Well a great note to end on. Thank you so much for coming on theCUBE. Sudhir, Ram, that you so much. >> Thank you, thanks a lot. >> Thank you. >> I'm Rebecca Knight for James Kobielus, we will have more tomorrow from DataWorks. We will see you tomorrow. This is theCUBE signing off. >> From sunny San Jose. >> That's right.

Published Date : Jun 20 2018

SUMMARY :

in the heart of Silicon Valley, for coming on the show. So, I want to start out by asking you to run on the Google Cloud Platform. and as they look at moving to cloud, in the Google Cloud. So, essentially, deep in the heart of HDP, and the cost efficiency is scale the storage and to do the training which and you can have the same that one pane of glass With the customer, it's and just have the specific of the Spark jobs or whatever. of the underlying cluster and then, what you can and so the life of a data that the world will be And clearly now that you got, Sudhir, Ram, that you so much. We will see you tomorrow.

ENTITIES

Entity	Category	Confidence
James Kobielus	PERSON	0.99+
Rebecca Knight	PERSON	0.99+
Rebecca	PERSON	0.99+
two	QUANTITY	0.99+
Sudhir	PERSON	0.99+
Ram Venkatesh	PERSON	0.99+
San Jose	LOCATION	0.99+
HortonWorks	ORGANIZATION	0.99+
Sudhir Hasbe	PERSON	0.99+
Google	ORGANIZATION	0.99+
Hortonworks	ORGANIZATION	0.99+
Silicon Valley	LOCATION	0.99+
two guests	QUANTITY	0.99+
San Jose, California	LOCATION	0.99+
DataWorks	ORGANIZATION	0.99+
tomorrow	DATE	0.99+
Ram	PERSON	0.99+
AWS	ORGANIZATION	0.99+
one example	QUANTITY	0.99+
one	QUANTITY	0.99+
two offerings	QUANTITY	0.98+
12 months	QUANTITY	0.98+
One	QUANTITY	0.98+
Day One	QUANTITY	0.98+
DataWorks Summit 2018	EVENT	0.97+
IBM	ORGANIZATION	0.97+
second example	QUANTITY	0.97+
Google Cloud Platform	TITLE	0.96+
Atlas	ORGANIZATION	0.96+
Google Cloud	TITLE	0.94+
Apache Ranger	ORGANIZATION	0.92+
three key areas	QUANTITY	0.92+
Hadoop	TITLE	0.91+
Kafka	TITLE	0.9+
theCUBE	ORGANIZATION	0.88+
earlier this morning	DATE	0.87+
Apache Hive	ORGANIZATION	0.86+
GCP	TITLE	0.86+
one pane	QUANTITY	0.86+
IBM Data Science	ORGANIZATION	0.84+
Azure	TITLE	0.82+
Spark	TITLE	0.81+
first	QUANTITY	0.79+
HDF	ORGANIZATION	0.74+
once in a month	QUANTITY	0.73+
HDP	ORGANIZATION	0.7+
TensorFlow	OTHER	0.69+
Hortonworks DataPlatform	ORGANIZATION	0.67+
Apache Spark	ORGANIZATION	0.61+
GCS	OTHER	0.57+
HDP	TITLE	0.5+
DSX	TITLE	0.49+
Cloud Storage	TITLE	0.47+

John Kreisa, Hortonworks | DataWorks Summit 2018

>> Live from San José, in the heart of Silicon Valley, it's theCUBE! Covering DataWorks Summit 2018. Brought to you by Hortonworks. (electro music) >> Welcome back to theCUBE's live coverage of DataWorks here in sunny San José, California. I'm your host, Rebecca Knight, along with my co-host, James Kobielus. We're joined by John Kreisa. He is the VP of marketing here at Hortonworks. Thanks so much for coming on the show. >> Thank you for having me. >> We've enjoyed watching you on the main stage, it's been a lot of fun. >> Thank you, it's been great. It's been great general sessions, some great talks. Talking about the technology, we've heard from some customers, some third parties, and most recently from Kevin Slavin from The Shed which is really amazing. >> So I really want to get into this event. You have 2,100 attendees from 23 different countries, 32 different industries. >> Yep. This started as a small, >> That's right. tiny little thing! >> Didn't Yahoo start it in 2008? >> It did, yeah. >> You changed names a few year ago, but it's still the same event, looming larger and larger. >> Yeah! >> It's been great, it's gone international as you've said. It's actually the 17th total event that we've done. >> Yeah. >> If you count the ones we've done in Europe and Asia. It's a global community around data, so it's no surprise. The growth has been phenomenal, the energy is great, the innovations that the community is talking about, the ecosystem is talking about, is really great. It just continues to evolve as an event, it continues to bring new ideas and share those ideas. >> What are you hearing from customers? What are they buzzing about? Every morning on the main stage, you do different polls that say, "how much are you using machine learning? What portion of your data are you moving to the cloud?" What are you learning? >> So it's interesting because we've done similar polls in our show in Berlin, and the results are very similar. We did the cloud poll pole and there's a lot of buzz around cloud. What we're hearing is there's a lot of companies that are thinking about, or are somewhere along their cloud journey. It's exactly what their overall plans are, and there's a lot of news about maybe cloud will eat everything, but if you look at the pole results, something like 75% of the attendees said they have cloud in their plans. Only about 12% said they're going to move everything to the cloud, so a lot of hybrid with cloud. It's how to figure out which work loads to run where, how to think about that strategy in terms of where to deploy the data, where to deploy the work loads and what that should look like and that's one of the main things that we're hearing and talking a lot about. >> We've been seeing that Wikiban and our recent update to the recent market forecast showed that public cloud will dominate increasingly in the coming decade, but hybrid cloud will be a long transition period for many or most enterprises who are still firmly rooted in on-premises employment, so forth and so on. Clearly, the bulk of your customers, both of your custom employments are on premise. >> They are. >> So you're working from a good starting point which means you've got what, 1,400 customers? >> That's right, thereabouts. >> Predominantly on premises, but many of them here at this show want to sustain their investment in a vendor that provides them with that flexibility as they decide they want to use Google or Microsoft or AWS or IBM for a particular workload that their existing investment to Hortonworks doesn't prevent them from facilitating. It moves that data and those workloads. >> That's right. The fact that we want to help them do that, a lot of our customers have, I'll call it a multi-cloud strategy. They want to be able to work with an Amazon or a Google or any of the other vendors in the space equally well and have the ability to move workloads around and that's one of the things that we can help them with. >> One of the things you also did yesterday on the main stage, was you talked about this conference in the greater context of the world and what's going on right now. This is happening against the backdrop of the World Cup, and you said that this is really emblematic of data because this is a game, a tournament that generates tons of data. >> A tremendous amount of data. >> It's showing how data can launch new business models, disrupt old ones. Where do you think we're at right now? For someone who's been in this industry for a long time, just lay the scene. >> I think we're still very much at the beginning. Even though the conference has been around for awhile, the technology has been. It's emerging so fast and just evolving so fast that we're still at the beginning of all the transformations. I've been listening to the customer presentations here and all of them are at some point along the journey. Many are really still starting. Even in some of the polls that we had today talked about the fact that they're very much at the beginning of their journey with things like streaming or some of the A.I. machine learning technologies. They're at various stages, so I believe we're really at the beginning of the transformation that we'll see. >> That reminds me of another detail of your product portfolio or your architecture streaming and edge deployments are also in the future for many of your customers who still primarily do analytics on data at rest. You made an investment in a number of technologies NiFi from streaming. There's something called MiNiFi that has been discussed here at this show as an enabler for streaming all the way out to edge devices. What I'm getting at is that's indicative of Arun Murthy, one of your co-founders, has made- it was a very good discussion for us analysts and also here at the show. That is one of many investments you're making is to prepare for a future that will set workloads that will be more predominant in the coming decade. One of the new things I've heard this week that I'd not heard in terms of emphasis from you guys is more of an emphasis on data warehousing as an important use case for HDP in your portfolios, specifically with HIVE. The HIVE 3.0 now in- HDP3.0. >> Yes. >> With the enhancements to HIVE to support more real time and low latency, but also there's ACID capabilities there. I'm hearing something- what you guys are doing is consistent with one of your competitors, Cloudera. They're going deeper into data warehousing too because they recognize they've got to got there like you do to be able to absorb more of your customers' workloads. I think that's important that you guys are making that investment. You're not just big data, you're all data and all data applications. Potentially, if your customers want to go there and engage you. >> Yes. >> I think that was a significant, subtle emphasis that me as an analyst noticed. >> Thank you. There were so many enhancements in 3.0 that were brought from the community that it was hard to talk about everything in depth, but you're right. The enhancements to HIVE in terms of performance have really enabled it to take on a greater set of workloads and inner activity that we know that our customers want. The advantage being that you have a common data layer in the back end and you can run all this different work. It might be data warehousing, high speed query workloads, but you can do it on that same data with Spark and data-science related workloads. Again, it's that common pool backend of the data lake and having that ability to do it with common security and governance. It's one of the benefits our customers are telling us they really appreciate. >> One of the things we've also heard this morning was talking about data analytics in terms of brand value and brand protection importantly. Fedex, exactly. Talking about, the speaker said, we've all seen these apology commercials. What do you think- is it damage control? What is the customer motivation here? >> Well a company can have billions of dollars of market cap wiped out by breeches in security, and we've seen it. This is not theoretical, these are actual occurrences that we've seen. Really, they're trying to protect the brand and the business and continue to be viable. They can get knocked back so far that it can take years to recover from the impact. They're looking at the security aspects of it, the governance of their data, the regulations of GVPR. These things you've mentioned have real financial impact on the businesses, and I think it's brand and the actual operations and finances of the businesses that can be impacted negatively. >> When you're thinking about Hortonworks's marketing messages going forward, how do you want to be described now, and then how do you want customers to think of you five or 10 years from now? >> I want them to think of us as a partner to help us with their data journey, on all aspects of their data journey, whether they're collecting data from the EDGE, you mentioned NiFi and things like that. Bringing that data back, processing it in motion, as well as processing it in rest, regardless of where that data lands. On premise, in the cloud, somewhere in between, the hybrid, multi-cloud strategy. We really want to be thought of as their partner in their data journey. That's really what we're doing. >> Even going forward, one of the things you were talking about earlier is the company's sort of saying, "we want to be boring. We want to help you do all the stuff-" >> There's a lot of money in boring. >> There's a lot of money, right! Exactly! As you said, a partner in their data journey. Is it "we'll do anything and everything"? Are you going to do niche stuff? >> That's a good question. Not everything. We are focused on the data layer. The movement of data, the process and storage, and truly the analytic applications that can be built on top of the platform. Right now we've stuck to our strategy. It's been very consistent since the beginning of the company in terms of taking these open source technologies, making them enterprise viable, developing an eco-system around it and fostering a community around it. That's been our strategy since before the company even started. We want to continue to do that and we will continue to do that. There's so much innovation happening in the community that we quickly bring that into the products and make sure that's available in a trusted, enterprise-tested platform. That's really one of the things we see our customers- over and over again they select us because we bring innovation to them quickly, in a safe and consumable way. >> Before we came on camera, I was telling Rebecca that Hortonworks has done a sensational job of continuing to align your product roadmaps with those of your leading partners. IBM, AWS, Microsoft. In many ways, your primary partners are not them, but the entire open source community. 26 open source projects in which Hortonworks represents and incorporated in your product portfolio in which you are a primary player and committer. You're a primary ingester of innovation from all the communities in which you operate. >> We do. >> That is your core business model. >> That's right. We both foster the innovation and we help drive the information ourselves with our engineers and architects. You're absolutely right, Jim. It's the ability to get that innovation, which is happening so fast in the community, into the product and companies need to innovate. Things are happening so fast. Moore's Law was mentioned multiple times on the main stage, you know, and how it's impacting different parts of the organization. It's not just the technology, but business models are evolving quickly. We heard a little bit about Trumble, and if you've seen Tim Leonard's talk that he gave around what they're doing in terms of logistics and the ability to go all the way out to the farmer and impact what's happening at the farm and tracking things down to the level of a tomato or an egg all the way back and just understand that. It's evolving business models. It's not just the tech but the evolution of business models. Rob talked about it yesterday. I think those are some of the things that are kind of key. >> Let me stay on that point really quick. Industrial internet like precision agriculture and everything it relates to, is increasingly relying on visual analysis, parts and eggs and whatever it might be. That is convolutional neural networks, that is A.I., it has to be trained, and it has to be trained increasingly in the cloud where the data lives. The data lives in H.D.P, clusters and whatnot. In many ways, no matter where the world goes in terms of industrial IoT, there will be massive cluster of HTFS and object storage driving it and also embedded A.I. models that have to follow a specific DevOps life cycle. You guys have a strong orientation in your portfolio towards that degree of real-time streaming, as it were, of tasks that go through the entire life cycle. From the preparing the data, to modeling, to training, to deploying it out, to Google or IBM or wherever else they want to go. So I'm thinking that you guys are in a good position for that as well. >> Yeah. >> I just wanted to ask you finally, what is the takeaway? We're talking about the attendees, talking about the community that you're cultivating here, theme, ideas, innovation, insight. What do you hope an attendee leaves with? >> I hope that the attendee leaves educated, understanding the technology and the impacts that it can have so that they will go back and change their business and continue to drive their data projects. The whole intent is really, and we even changed the format of the conference for more educational opportunities. For me, I want attendees to- a satisfied attendee would be one that learned about the things they came to learn so that they could go back to achieve the goals that they have when they get back. Whether it's business transformation, technology transformation, some combination of the two. To me, that's what I hope that everyone is taking away and that they want to come back next year when we're in Washington, D.C. and- >> My stomping ground. >> His hometown. >> Easy trip for you. They'll probably send you out here- (laughs) >> Yeah, that's right. >> Well John, it's always fun talking to you. Thank you so much. >> Thank you very much. >> We will have more from theCUBE's live coverage of DataWorks right after this. I'm Rebecca Knight for James Kobielus. (upbeat electro music)

Published Date : Jun 20 2018

SUMMARY :

in the heart of Silicon Valley, He is the VP of marketing you on the main stage, Talking about the technology, So I really want to This started as a small, That's right. but it's still the same event, It's actually the 17th total event the innovations that the community is that's one of the main things that Clearly, the bulk of your customers, their existing investment to Hortonworks have the ability to move workloads One of the things you also did just lay the scene. Even in some of the polls that One of the new things I've heard this With the enhancements to HIVE to subtle emphasis that me the data lake and having that ability to One of the things we've also aspects of it, the the EDGE, you mentioned NiFi and one of the things you were talking There's a lot of money, right! That's really one of the things we all the communities in which you operate. It's the ability to get that innovation, the cloud where the data lives. talking about the community that learned about the things they came to They'll probably send you out here- fun talking to you. coverage of DataWorks right after this.

ENTITIES

Entity	Category	Confidence
James Kobielus	PERSON	0.99+
Rebecca Knight	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Rebecca	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Tim Leonard	PERSON	0.99+
AWS	ORGANIZATION	0.99+
Arun Murthy	PERSON	0.99+
Jim	PERSON	0.99+
Kevin Slavin	PERSON	0.99+
Europe	LOCATION	0.99+
John Kreisa	PERSON	0.99+
Berlin	LOCATION	0.99+
Amazon	ORGANIZATION	0.99+
John	PERSON	0.99+
Google	ORGANIZATION	0.99+
2008	DATE	0.99+
Washington, D.C.	LOCATION	0.99+
Asia	LOCATION	0.99+
75%	QUANTITY	0.99+
Rob	PERSON	0.99+
five	QUANTITY	0.99+
San José	LOCATION	0.99+
next year	DATE	0.99+
Yahoo	ORGANIZATION	0.99+
Silicon Valley	LOCATION	0.99+
32 different industries	QUANTITY	0.99+
World Cup	EVENT	0.99+
yesterday	DATE	0.99+
23 different countries	QUANTITY	0.99+
one	QUANTITY	0.99+
1,400 customers	QUANTITY	0.99+
today	DATE	0.99+
two	QUANTITY	0.99+
2,100 attendees	QUANTITY	0.99+
Fedex	ORGANIZATION	0.99+
10 years	QUANTITY	0.99+
26 open source projects	QUANTITY	0.99+
Hortonworks	ORGANIZATION	0.98+
17th	QUANTITY	0.98+
both	QUANTITY	0.98+
One	QUANTITY	0.98+
billions of dollars	QUANTITY	0.98+
Cloudera	ORGANIZATION	0.97+
about 12%	QUANTITY	0.97+
theCUBE	ORGANIZATION	0.97+
this week	DATE	0.96+
DataWorks Summit 2018	EVENT	0.95+
NiFi	ORGANIZATION	0.91+
this morning	DATE	0.89+
HIVE 3.0	OTHER	0.86+
Spark	TITLE	0.86+
few year ago	DATE	0.85+
Wikiban	ORGANIZATION	0.85+
The Shed	ORGANIZATION	0.84+
San José, California	LOCATION	0.84+
tons	QUANTITY	0.82+
H.D.P	LOCATION	0.82+
DataWorks	EVENT	0.81+
things	QUANTITY	0.78+
DataWorks	ORGANIZATION	0.74+
MiNiFi	TITLE	0.62+
data	QUANTITY	0.61+
Moore	TITLE	0.6+
years	QUANTITY	0.59+
coming decade	DATE	0.59+
Trumble	ORGANIZATION	0.59+
GVPR	ORGANIZATION	0.58+
3.0	OTHER	0.56+

Day Two Kickoff | DataWorks Summit 2018

>> Live from San Jose, in the heart of Silicon Valley, it's theCube. Covering DataWorks Summit 2018. Brought to you by Hortonworks. >> Welcome back to day two of theCube's live coverage of DataWorks here in San Jose, California. I'm your host, Rebecca Knight along with my co-host James Kobielus. James, it's great to be here with you in the hosting seat again. >> Day two, yes. >> Exactly. So here we are, this conference, 2,100 attendees from 32 countries, 23 industries. It's a relatively big show. They do three of them during the year. One of the things that I really-- >> It's a well-established show too. I think this is like the 11th year since Yahoo started up the first Hadoop summit in 2008. >> Right, right. >> So it's an established event, yeah go. >> Exactly, exactly. But I really want to talk about Hortonworks the company. This is something that you had brought up in an analyst report before the show started and that was talking about Hortonworks' cash flow positivity for the first time. >> Which is good. >> Which is good, which is a positive sign and yet what are the prospects for this company's financial health? We're still not seeing really clear signs of robust financial growth. >> I think the signs are good for the simple reason they're making significant investments now to prepare for the future that's almost inevitable. And the future that's almost inevitable, and when I say the future, the 2020s, the decade that's coming. Most of their customers will shift more of their workloads, maybe not entirely yet, to public cloud environments for everything they're doing, AI, machine learning, deep learning. And clearly the beneficiaries of that trend will be the public cloud providers, all of whom are Hortonworks' partners and established partners, AWS, Microsoft with Azure, Google with, you know, Google Cloud Platform, IBM with IBM Cloud. Hortonworks, and this is... You know, their partnerships with these cloud providers go back several years so it's not a new initiative for them. They've seen the writing on the wall practically from the start of Hortonworks' founding in 2011 and they now need to go deeper towards making their solution portfolio capable of being deployable on-prem, in cloud, public clouds, and in various and sundry funky combinations called hybrid multi-clouds. Okay, so, they've been making those investments in those partnerships and in public cloud enabling the Hortonworks Data Platform. Here at this show, DataWorks 2018 here in San Jose, they've released the latest major version, HDP 3.0 of their core platform with a lot of significant enhancements related to things that their customers are increasingly doing-- >> Well I want to ask you about those enhancements. >> But also they have partnership announcements, the deep ones of integration and, you know, lift and shift of the Hortonworks portfolio of HDP with Hortonworks DataFlow and DataPlane Services, so that those solutions can operate transparently on those public cloud environments as the customers, as and when the customers choose to shift their workloads. 'Cause Hortonworks really... You know, like Scott Gnau yesterday, I mean just laid it on the line, they know that the more of the public cloud workloads will predominate now in this space. They're just making these speculative investments that they absolutely have to now to prepare the way. So I think this cost that they're incurring now to prepare their entire portfolio for that inevitable future is the right thing to do and that's probably why they still have not attained massive rock and rollin' positive cash flow yet but I think that they're preparing the way for them to do so in the coming decade. >> So their financial future is looking brighter and they're doing the right things. >> Yeah, yes. >> So now let's talk tech. And this is really where you want to be, Jim, I know you. >> Oh I get sleep now and I don't think about tech constantly. >> So as you've said, they're really doing a lot of emphasis now on their public cloud partnerships. >> Yes. >> But they've also launched several new products and upgrades to existing products, what are you seeing that excites you and that you think really will be potential game changers? >> You know, this is geeky but this is important 'cause it's at the very heart of Hortonworks Data Platform 3.0, containerization of more... When you're a data scientist, and you're building a machine learning model using data that's maintained, and is persisted, and processed within Hortonworks Data Platform or any other big data platform, you want the ability increasingly for developing machine learning, deep learning, AI in general, to take that application you might build while you're using TensorFlow models, that you build on HDP, they will containerize it in Docker and, you know, orchestrate it all through Kubernetes and all that wonderful stuff, and deploy it out, those AI, out to increasingly edge computing, mobile computing, embedded computing environments where, you know, the real venture capital mania's happening, things like autonomous vehicles, and you know, drones, and you name it. So the fact is that Hortonworks has made that in many ways the premier new feature of HDP 3.0 announced here this week at the show. That very much harmonizes with what their partners, where their partners are going with containerization of AI. IBM, one of their premier partners, very recently, like last month, I think it was, announced the latest version of IBM, what do they call it, IBM Cloud Private, which has embedded as a core feature containerization within that environment which is a prem-based environment of AI and so forth. The fact that Hortonworks continues to maintain close alignment with the capabilities that its public cloud partners are building to their respective portfolios is important. But also Hortonworks with its, they call it, you know, a single pane of glass, the DataPlane Services for metadata and monitoring and governance and compliance across this sprawling hybrid multi-cloud, these scenarios. The fact that they're continuing to make, in fact, really focusing on deep investments in that portfolio, so that when an IBM introduces or, AWS, whoever, introduces some new feature in their respective platforms, Hortonworks has the ability to, as it were, abstract above and beyond all of that so that the customer, the developer, and the data administrator, all they need to do, if they're a Hortonworks customer, is stay within the DataPlane Services and environment to be able to deploy with harmonized metadata and harmonized policies, and harmonized schemas and so forth and so on, and query optimization across these sprawling environments. So Hortonworks, I think, knows where their bread is buttered and it needs to stay on the DPS, DataPlane Services, side which is why a couple months ago in Berlin, Hortonworks made a, I think, the most significant announcement of the year for them and really for the industry, was that they announced the Data Steward Studio in Berlin. Tech really clearly was who addressed the GDPR mandate that was coming up but really did a stewardship as an end-to-end workflow for lots of, you know, core enterprise applications, absolutely essential. Data Steward Studio is a DataPlane Service that can operate across multi-cloud environments. Hortonworks is going to keep on, you know... They didn't have a DPS, DataPlane Services, announcements here in San Jose this week but you can best believe that next year at this time at this show, and in the interim they'll probably have a number of significant announcements to deepen that portfolio. Once again it's to grease the wheels towards a more purely public cloud future in which there will be Hortonworks DNA inside most of their customers' environments going forward. >> I want to ask you about themes of this year's conference. The thing is is that you were in Berlin at the last big Hortonworks DataWorks Summit. >> (speaks in foreign language) >> And really GDPR dominated the conversations because the new rules and regulations hadn't yet taken effect and companies were sort of bracing for what life was going to be like under GDPR. Now the rules are here, they're here to stay, and companies are really grappling with it, trying to understand the changes and how they can exist in this new regime. What would you say are the biggest themes... We're still talking about GDPR, of course, but what would you say are the bigger themes that are this week's conference? Is it scalability, is it... I mean, what would you say we're going, what do you think has dominated the conversations here? >> Well scalability is not the big theme this week though there are significant scalability announcements this week in the context of HDP 3.0, the ability to persist in a scale-out fashion across multi-cloud, billions of files. Storage efficiency is an important piece of the overall announcement with support for erasure coding, blah blah blah. That's not, you know, that's... Already, Hortonworks, like all of their cloud providers and other big data providers, provide very scalable environments for storage, workload management. That was not the hugest, buzzy theme in terms of the announcements this week. The buzz of course was HDP 3.0. Containerization, that's important, but you know, we just came out of the day two keynote. AI is not a huge focus yet for a lot of the Hortonworks customers who are here, the developers. They're, you know, most of their customers are not yet that far along in their deep learning journeys and whatever but they're definitely going there. There's plenty of really cool keynote discussions including the guy with the autonomous vehicles or whatever that, the thing we just came out of. That was not the predominant theme this week here in terms of the HDP 3.0. I think what it comes down to is that with HDP 3.0... Hive, though you tend to take it for granted, it's been in Hadoop from the very start, practically, Hive is now a full enterprise database and that's the core, one of the cores, of HDP 3.0. Hive itself, Hive 3.0 now is its version, is ACID compliant and that may be totally geeky to the most of the world but that enables it to support transactional applications. So more big data in every environment is supporting more traditional enterprise application, transactional applications that require like two-phase commit and all that goodness. The fact is, you know, Hortonworks have, from what I can see, is the first of the big data vendors to incorporate those enhancements to Hive 3.0 because they're so completely tuned in to the Hive environment in terms of a committer. I think in many ways that is the predominant theme in terms of the new stuff that will actually resonate with the developers, their customers here at the show. And with the, you know, enterprises in general, they can put more of their traditional enterprise application workloads on big data environments and specifically, Hortonworks hopes, its HDP 3.0. >> Well I'm excited to learn more here at the on theCube with you today. We've got a lot of great interviews lined up and a lot of interesting content. We got a great crew too so this is a fun show to do. >> Sure is. >> We will have more from day two of the.

Published Date : Jun 20 2018

SUMMARY :

Live from San Jose, in the heart James, it's great to be here with you One of the things that I really-- I think this is like the So it's an This is something that you had brought up of robust financial growth. in public cloud enabling the Well I want to ask you is the right thing to do doing the right things. And this is really where you Oh I get sleep now and I don't think of emphasis now on their announcement of the year at the last big Hortonworks because the new rules of the announcements this week. this is a fun show to do.

ENTITIES

Entity	Category	Confidence
James Kobielus	PERSON	0.99+
Rebecca Knight	PERSON	0.99+
Hortonworks'	ORGANIZATION	0.99+
Hortonworks	ORGANIZATION	0.99+
2011	DATE	0.99+
Jim	PERSON	0.99+
IBM	ORGANIZATION	0.99+
Berlin	LOCATION	0.99+
AWS	ORGANIZATION	0.99+
San Jose	LOCATION	0.99+
Microsoft	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
Silicon Valley	LOCATION	0.99+
James	PERSON	0.99+
23 industries	QUANTITY	0.99+
Yahoo	ORGANIZATION	0.99+
San Jose, California	LOCATION	0.99+
Hive 3.0	TITLE	0.99+
2020s	DATE	0.99+
next year	DATE	0.99+
this week	DATE	0.99+
32 countries	QUANTITY	0.99+
Hive	TITLE	0.99+
11th year	QUANTITY	0.99+
yesterday	DATE	0.99+
first time	QUANTITY	0.99+
GDPR	TITLE	0.98+
last month	DATE	0.98+
DataPlane Services	ORGANIZATION	0.98+
One	QUANTITY	0.98+
Scott Gnau	PERSON	0.98+
2008	DATE	0.98+
three	QUANTITY	0.98+
2,100 attendees	QUANTITY	0.98+
HDP 3.0	TITLE	0.98+
today	DATE	0.98+
Data Steward Studio	ORGANIZATION	0.98+
two-phase	QUANTITY	0.98+
one	QUANTITY	0.97+
DataWorks Summit 2018	EVENT	0.96+
DataPlane	ORGANIZATION	0.96+
Day two	QUANTITY	0.96+
billions of files	QUANTITY	0.95+
first	QUANTITY	0.95+
day two	QUANTITY	0.95+
DPS	ORGANIZATION	0.95+
Data Platform 3.0	TITLE	0.94+
Hortonworks DataWorks Summit	EVENT	0.94+
DataWorks	EVENT	0.92+

Pandit Prasad, IBM | DataWorks Summit 2018

>> From San Jose, in the heart of Silicon Valley, it's theCube. Covering DataWorks Summit 2018. Brought to you by Hortonworks. (upbeat music) >> Welcome back to theCUBE's live coverage of Data Works here in sunny San Jose, California. I'm your host Rebecca Knight along with my co-host James Kobielus. We're joined by Pandit Prasad. He is the analytics, projects, strategy, and management at IBM Analytics. Thanks so much for coming on the show. >> Thanks Rebecca, glad to be here. >> So, why don't you just start out by telling our viewers a little bit about what you do in terms of in relationship with the Horton Works relationship and the other parts of your job. >> Sure, as you said I am in Offering Management, which is also known as Product Management for IBM, manage the big data portfolio from an IBM perspective. I was also working with Hortonworks on developing this relationship, nurturing that relationship, so it's been a year since the Northsys partnership. We announced this partnership exactly last year at the same conference. And now it's been a year, so this year has been a journey and aligning the two portfolios together. Right, so Hortonworks had HDP HDF. IBM also had similar products, so we have for example, Big Sequel, Hortonworks has Hive, so how Hive and Big Sequel align together. IBM has a Data Science Experience, where does that come into the picture on top of HDP, so it means before this partnership if you look into the market, it has been you sell Hadoop, you sell a sequel engine, you sell Data Science. So what this year has given us is more of a solution sell. Now with this partnership we go to the customers and say here is NTN experience for you. You start with Hadoop, you put more analytics on top of it, you then bring Big Sequel for complex queries and federation visualization stories and then finally you put Data Science on top of it, so it gives you a complete NTN solution, the NTN experience for getting the value out of the data. >> Now IBM a few years back released a Watson data platform for team data science with DSX, data science experience, as one of the tools for data scientists. Is Watson data platform still the core, I call it dev ops for data science and maybe that's the wrong term, that IBM provides to market or is there sort of a broader dev ops frame work within which IBM goes to market these tools? >> Sure, Watson data platform one year ago was more of a cloud platform and it had many components of it and now we are getting a lot of components on to the (mumbles) and data science experience is one part of it, so data science experience... >> So Watson analytics as well for subject matter experts and so forth. >> Yes. And again Watson has a whole suit of side business based offerings, data science experience is more of a a particular aspect of the focus, specifically on the data science and that's been now available on PRAM and now we are building this arm from stack, so we have HDP, HDF, Big Sequel, Data Science Experience and we are working towards adding more and more to that portfolio. >> Well you have a broader reference architecture and a stack of solutions AI and power and so for more of the deep learning development. In your relationship with Hortonworks, are they reselling more of those tools into their customer base to supplement, extend what they already resell DSX or is that outside of the scope of the relationship? >> No it is all part of the relationship, these three have been the core of what we announced last year and then there are other solutions. We have the whole governance solution right, so again it goes back to the partnership HDP brings with it Atlas. IBM has a whole suite of governance portfolio including the governance catalog. How do you expand the story from being a Hadoop-centric story to an enterprise data-like story, and then now we are taking that to the cloud that's what Truata is all about. Rob Thomas came out with a blog yesterday morning talking about Truata. If you look at it is nothing but a governed data-link hosted offering, if you want to simplify it. That's one way to look at it caters to the GDPR requirements as well. >> For GDPR for the IBM Hortonworks partnership is the lead solution for GDPR compliance, is it Hortonworks Data Steward Studio or is it any number of solutions that IBM already has for data governance and curation, or is it a combination of all of that in terms of what you, as partners, propose to customers for soup to nuts GDPR compliance? Give me a sense for... >> It is a combination of all of those so it has a HDP, its has HDF, it has Big Sequel, it has Data Science Experience, it had IBM governance catalog, it has IBM data quality and it has a bunch of security products, like Gaurdium and it has some new IBM proprietary components that are very specific towards data (cough drowns out speaker) and how do you deal with the personal data and sensitive personal data as classified by GDPR. I'm supposed to query some high level information but I'm not allowed to query deep into the personal information so how do you blog those queries, how do you understand those, these are not necessarily part of Data Steward Studio. These are some of the proprietary components that are thrown into the mix by IBM. >> One of the requirements that is not often talked about under GDPR, Ricky of Formworks got in to it a little bit in his presentation, was the notion that the requirement that if you are using an UE citizen's PII to drive algorithmic outcomes, that they have the right to full transparency. It's the algorithmic decision paths that were taken. I remember IBM had a tool under the Watson brand that wraps up a narrative of that sort. Is that something that IBM still, it was called Watson Curator a few years back, is that a solution that IBM still offers, because I'm getting a sense right now that Hortonworks has a specific solution, not to say that they may not be working on it, that addresses that side of GDPR, do you know what I'm referring to there? >> I'm not aware of something from the Hortonworks side beyond the Data Steward Studio, which offers basically identification of what some of the... >> Data lineage as opposed to model lineage. It's a subtle distinction. >> It can identify some of the personal information and maybe provide a way to tag it and hence, mask it, but the Truata offering is the one that is bringing some new research assets, after GDPR guidelines became clear and then they got into they are full of how do we cater to those requirements. These are relatively new proprietary components, they are not even being productized, that's why I am calling them proprietary components that are going in to this hosting service. >> IBM's got a big portfolio so I'll understand if you guys are still working out what position. Rebecca go ahead. >> I just wanted to ask you about this new era of GDPR. The last Hortonworks conference was sort of before it came into effect and now we're in this new era. How would you say companies are reacting? Are they in the right space for it, in the sense of they're really still understand the ripple effects and how it's all going to play out? How would you describe your interactions with companies in terms of how they're dealing with these new requirements? >> They are still trying to understand the requirements and interpret the requirements coming to terms with what that really means. For example I met with a customer and they are a multi-national company. They have data centers across different geos and they asked me, I have somebody from Asia trying to query the data so that the query should go to Europe, but the query processing should not happen in Asia, the query processing all should happen in Europe, and only the output of the query should be sent back to Asia. You won't be able to think in these terms before the GDPR guidance era. >> Right, exceedingly complicated. >> Decoupling storage from processing enables those kinds of fairly complex scenarios for compliance purposes. >> It's not just about the access to data, now you are getting into where the processing happens were the results are getting displayed, so we are getting... >> Severe penalties for not doing that so your customers need to keep up. There was announcement at this show at Dataworks 2018 of an IBM Hortonwokrs solution. IBM post-analytics with with Hortonworks. I wonder if you could speak a little bit about that, Pandit, in terms of what's provided, it's a subscription service? If you could tell us what subset of IBM's analytics portfolio is hosted for Hortonwork's customers? >> Sure, was you said, it is a a hosted offering. Initially we are starting of as base offering with three products, it will have HDP, Big Sequel, IBM DB2 Big Sequel and DSX, Data Science Experience. Those are the three solutions, again as I said, it is hosted on IBM Cloud, so customers have a choice of different configurations they can choose, whether it be VMs or bare metal. I should say this is probably the only offering, as of today, that offers bare metal configuration in the cloud. >> It's geared to data scientist developers and machine-learning models will build the models and train them in IBM Cloud, but in a hosted HDP in IBM Cloud. Is that correct? >> Yeah, I would rephrase that a little bit. There are several different offerings on the cloud today and we can think about them as you said for ad-hoc or ephemeral workloads, also geared towards low cost. You think about this offering as taking your on PRAM data center experience directly onto the cloud. It is geared towards very high performance. The hardware and the software they are all configured, optimized for providing high performance, not necessarily for ad-hoc workloads, or ephemeral workloads, they are capable of handling massive workloads, on sitcky workloads, not meant for I turned this massive performance computing power for a couple of hours and then switched them off, but rather, I'm going to run these massive workloads as if it is located in my data center, that's number one. It comes with the complete set of HDP. If you think about it there are currently in the cloud you have Hive and Hbase, the sequel engines and the stories separate, security is optional, governance is optional. This comes with the whole enchilada. It has security and governance all baked in. It provides the option to use Big Sequel, because once you get on Hadoop, the next experience is I want to run complex workloads. I want to run federated queries across Hadoop as well as other data storage. How do I handle those, and then it comes with Data Science Experience also configured for best performance and integrated together. As a part of this partnership, I mentioned earlier, that we have progress towards providing this story of an NTN solution. The next steps of that are, yeah I can say that it's an NTN solution but are the product's look and feel as if they are one solution. That's what we are getting into and I have featured some of those integrations. For example Big Sequel, IBM product, we have been working on baking it very closely with HDP. It can be deployed through Morey, it is integrated with Atlas and Granger for security. We are improving the integrations with Atlas for governance. >> Say you're building a Spark machine learning model inside a DSX on HDP within IH (mumbles) IBM hosting with Hortonworks on HDP 3.0, can you then containerize that machine learning Sparks and then deploy into an edge scenario? >> Sure, first was Big Sequel, the next one was DSX. DSX is integrated with HDP as well. We can run DSX workloads on HDP before, but what we have done now is, if you want to run the DSX workloads, I want to run a Python workload, I need to have Python libraries on all the nodes that I want to deploy. Suppose you are running a big cluster, 500 cluster. I need to have Python libraries on all 500 nodes and I need to maintain the versioning of it. If I upgrade the versions then I need to go and upgrade and make sure all of them are perfectly aligned. >> In this first version will you be able build a Spark model and a Tesorflow model and containerize them and deploy them. >> Yes. >> Across a multi-cloud and orchestrate them with Kubernetes to do all that meshing, is that a capability now or planned for the future within this portfolio? >> Yeah, we have that capability demonstrated in the pedestal today, so that is a new one integration. We can run virtual, we call it virtual Python environment. DSX can containerize it and run data that's foreclosed in the HDP cluster. Now we are making use of both the data in the cluster, as well as the infrastructure of the cluster itself for running the workloads. >> In terms of the layers stacked, is also incorporating the IBM distributed deep-learning technology that you've recently announced? Which I think is highly differentiated, because deep learning is increasingly become a set of capabilities that are across a distributed mesh playing together as is they're one unified application. Is that a capability now in this solution, or will it be in the near future? DPL distributed deep learning? >> No, we have not yet. >> I know that's on the AI power platform currently, gotcha. >> It's what we'll be talking about at next year's conference. >> That's definitely on the roadmap. We are starting with the base configuration of bare metals and VM configuration, next one is, depending on how the customers react to it, definitely we're thinking about bare metal with GPUs optimized for Tensorflow workloads. >> Exciting, we'll be tuned in the coming months and years I'm sure you guys will have that. >> Pandit, thank you so much for coming on theCUBE. We appreciate it. I'm Rebecca Knight for James Kobielus. We will have, more from theCUBE's live coverage of Dataworks, just after this.

Published Date : Jun 19 2018

SUMMARY :

Brought to you by Hortonworks. Thanks so much for coming on the show. and the other parts of your job. and aligning the two portfolios together. and maybe that's the wrong term, getting a lot of components on to the (mumbles) and so forth. a particular aspect of the focus, and so for more of the deep learning development. No it is all part of the relationship, For GDPR for the IBM Hortonworks partnership the personal information so how do you blog One of the requirements that is not often I'm not aware of something from the Hortonworks side Data lineage as opposed to model lineage. It can identify some of the personal information if you guys are still working out what position. in the sense of they're really still understand the and interpret the requirements coming to terms kinds of fairly complex scenarios for compliance purposes. It's not just about the access to data, I wonder if you could speak a little that offers bare metal configuration in the cloud. It's geared to data scientist developers in the cloud you have Hive and Hbase, can you then containerize that machine learning Sparks on all the nodes that I want to deploy. In this first version will you be able build of the cluster itself for running the workloads. is also incorporating the IBM distributed It's what we'll be talking next one is, depending on how the customers react to it, I'm sure you guys will have that. Pandit, thank you so much for coming on theCUBE.

ENTITIES

Entity	Category	Confidence
Rebecca	PERSON	0.99+
James Kobielus	PERSON	0.99+
Rebecca Knight	PERSON	0.99+
Europe	LOCATION	0.99+
IBM	ORGANIZATION	0.99+
Asia	LOCATION	0.99+
Rob Thomas	PERSON	0.99+
San Jose	LOCATION	0.99+
Silicon Valley	LOCATION	0.99+
Pandit	PERSON	0.99+
last year	DATE	0.99+
Python	TITLE	0.99+
yesterday morning	DATE	0.99+
Hortonworks	ORGANIZATION	0.99+
three solutions	QUANTITY	0.99+
Ricky	PERSON	0.99+
Northsys	ORGANIZATION	0.99+
Hadoop	TITLE	0.99+
Pandit Prasad	PERSON	0.99+
GDPR	TITLE	0.99+
IBM Analytics	ORGANIZATION	0.99+
first version	QUANTITY	0.99+
both	QUANTITY	0.99+
one year ago	DATE	0.98+
Hortonwork	ORGANIZATION	0.98+
three	QUANTITY	0.98+
today	DATE	0.98+
DSX	TITLE	0.98+
Formworks	ORGANIZATION	0.98+
this year	DATE	0.98+
Atlas	ORGANIZATION	0.98+
first	QUANTITY	0.98+
Granger	ORGANIZATION	0.97+
Gaurdium	ORGANIZATION	0.97+
one	QUANTITY	0.97+
Data Steward Studio	ORGANIZATION	0.97+
two portfolios	QUANTITY	0.97+
Truata	ORGANIZATION	0.96+
DataWorks Summit 2018	EVENT	0.96+
one solution	QUANTITY	0.96+
one way	QUANTITY	0.95+
next year	DATE	0.94+
500 nodes	QUANTITY	0.94+
NTN	ORGANIZATION	0.93+
Watson	TITLE	0.93+
Hortonworks	PERSON	0.93+

Dan Potter, Attunity & Ali Bajwa, Hortonworks | DataWorks Summit 2018

>> Live from San Jose in the heart of Silicon Valley, it's theCUBE, covering DataWorks Summit 2018, brought to you by Hortonworks. >> Welcome back to theCUBE's live coverage of DataWorks here in sunny San Jose, California. I'm your host Rebecca Knight along with my co-host James Kobielus. We're joined by Dan Potter. He is the VP Product Management at Attunity and also Ali Bajwah, who is the principal partner solutions engineer at Hortonworks. Thanks so much for coming on theCUBE. >> Pleasure to be here. >> It's good to be here. >> So I want to start with you, Dan, and have you tell our viewers a little bit about the company based in Boston, Massachusetts, what Attunity does. >> Attunity, we're a data integration vendor. We are best known as a provider of real-time data movement from transactional systems into data lakes, into clouds, into streaming architectures, so it's a modern approach to data integration. So as these core transactional systems are being updated, we're able to take those changes and move those changes where they're needed when they're needed for analytics for new operational applications, for a variety of different tasks. >> Change data capture. >> Change data capture is the heart of our-- >> They are well known in this business. They have changed data capture. Go ahead. >> We are. >> So tell us about the announcement today that Attunity has made at the Hortonworks-- >> Yeah, thank you, it's a great announcement because it showcases the collaboration between Attunity and Hortonworks and it's all about taking the metadata that we capture in that integration process. So we're a piece of a data lake architecture. As we are capturing changes from those source systems, we are also capturing the metadata, so we understand the source systems, we understand how the data gets modified along the way. We use that metadata internally and now we're built extensions to share that metadata into Atlas and to be able to extend that out through Atlas to higher data governance initiatives, so Data Steward Studio, into the DataPlane Services, so it's really important to be able to take the metadata that we have and to add to it the metadata that's from the other sources of information. >> Sure, for more of the transactional semantics of what Hortonworks has been describing they've baked in to HDP in your overall portfolios. Is that true? I mean, that supports those kind of requirements. >> With HTP, what we're seeing is you know the EDW optimization play has become more and more important for a lot of customers as they try to optimize the data that their EDWs are working on, so it really gels well with what we've done here with Attunity and then on the Atlas side with the integration on the governance side with GDPR and other sort of regulations coming into the play now, you know, those sort of things are becoming more and more important, you know, specifically around the governance initiative. We actually have a talk just on Thursday morning where we're actually showcasing the integration as well. >> So can you talk a little bit more about that for those who aren't going to be there for Thursday. GDPR was really a big theme at the DataWorks Berlin event and now we're in this new era and it's not talked about too, too much, I mean we-- >> And global business who have businesses at EU, but also all over the world, are trying to be systematic and are consistent about how they manage PII everywhere. So GDPR are those in EU regulation, really in many ways it's having ripple effects across the world in terms of practices. >> Absolutely and at the heart of understanding how you protect yourself and comply, I need to understand my data, and that's where metadata comes in. So having a holistic understanding of all of the data that resides in your data lake or in your cloud, metadata becomes a key part of that. And also in terms of enforcing that, if I understand my customer data, where the customer data comes from, the lineage from that, then I'm able to apply the protections of the masking on top of that data. So it's really, the GDPR effect has had, you know, it's created a broad-scale need for organizations to really get a handle on metadata so the timing of our announcement just works real well. >> And one nice thing about this integration is that you know it's not just about being able to capture the data in Atlas, but now with the integration of Atlas and Ranger, you can do enforcement of policies based on classifications as well, so if you can tag data as PCI, PII, personal data, that can get enforced through Ranger to say, hey, only certain admins can access certain types of data and now all that becomes possible once we've taken the initial steps of the Atlas integration. >> So with this collaboration, and it's really deepening an existing relationship, so how do you go to market? How do you collaborate with each other and then also service clients? >> You want to? >> Yeah, so from an engineering perspective, we've got deep roots in terms of being a first-class provider into the Hortonworks platform, both HDP and HDF. Last year about this time, we announced our support for acid merge capabilities, so the leading-edge work that Hortonworks has done in bringing acid compliance capabilities into Hive, was a really important one, so our change to data capture capabilities are able to feed directly into that and be able to support those extensions. >> Yeah, we have a lot of you know really key customers together with Attunity and you know maybe a a result of that they are actually our ISV of the Year as well, which they probably showcase on their booth there. >> We're very proud of that. Yeah, no, it's a nice honor for us to get that distinction from Hortonworks and it's also a proof point to the collaboration that we have commercially. You know our sales reps work hand in hand. When we go into a large organization, we both sell to very large organizations. These are big transformative initiatives for these organizations and they're looking for solutions not technologies, so the fact that we can come in, we can show the proof points from other customers that are successfully using our joint solution, that's really, it's critical. >> And I think it helps that they're integrating with some of our key technologies because, you know, that's where our sales force and our customers really see, you know, that as well as that's where we're putting in the investment and that's where these guys are also investing, so it really, you know, helps the story together. So with Hive, we're doing a lot of investment of making it closer and closer to a sort of real-time database, where you can combine historical insights as well as your, you know, real-time insights. with the new acid merge capabilities where you can do the inserts, updates and deletes, and so that's exactly what Attunity's integrating with with Atlas. We're doing a lot of investments there and that's exactly what these guys are integrating with. So I think our customers and prospects really see that and that's where all the wins are coming from. >> Yeah, and I think together there were two main barriers that we saw in terms of customers getting the most out of their data lake investment. One of them was, as I'm moving data into my data lake, I need to be able to put some structure around this, I need to be able to handle continuously updating data from multiple sources and that's what we introduce with Attunity composed for Hive, building out the structure in an automated fashion so I've got analytics-ready data and using the acid merge capabilities just made those updates much easier. The second piece was metadata. Business users need to have confidence that the data that they're using. Where did this come from? How is it modified? And overcoming both of those is really helping organizations make the most of those investments. >> How would you describe customer attitudes right now in terms of their approach to data because I mean, as we've talked about, data is the new oil, so there's a real excitement and there's a buzz around it and yet there's also so many high-profile cases of breeches and security concerns, so what would you say, is it that customers, are they more excited or are they more trepidatious? How would you describe the CIL mindset right now? >> So I think security and governance has become top of minds right, so more and more the serveways that we've taken with our customers, right, you know, more and more customers are more concerned about security, they're more concerned about governance. The joke is that we talk to some of our customers and they keep talking to us about Atlas, which is sort of one of the newer offerings on governance that we have, but then we ask, "Hey, what about Ranger for enforcement?" And they're like, "Oh, yeah, that's a standard now." So we have Ranger, now it's a question of you know how do we get our you know hooks into the Atlas and all that kind of stuff, so yeah, definitely, as you mentioned, because of GDPR, because of all these kind of issues that have happened, it's definitely become top of minds. >> And I would say the other side of that is there's real excitement as well about the possibilities. Now bringing together all of this data, AI, machine learning, real-time analytics and real-time visualization. There's analytic capabilities now that organizations have never had, so there's great excitement, but there's also trepidation. You know, how do we solve for both of those? And together, we're doing just that. >> But as you mentioned, if you look at Europe, some of the European companies that are more hit by GDPR, they're actually excited that now they can, you know, really get to understand their data more and do better things with it as a result of you know the GDPR initiative. >> Absolutely. >> Are you using machine learning inside of Attunity in a Hortonworks context to find patterns in that data in real time? >> So we enable data scientists to build those models. So we're not only bringing the data together but again, part of the announcement last year is the way we structure that data in Hive, we provide a complete historic data store so every single transaction that has happened and we send those transactions as they happen, it's at a big append, so if you're a data scientist, I want to understand the complete history of the transactions of a customer to be able to build those models, so building those out in Hive and making those analytics ready in Hive, that's what we do, so we're a key enabler to machine learning. >> Making analytics ready rather than do the analytics in the spring, yeah. >> Absolutely. >> Yeah, the other side to that is that because they're integrated with Atlas, you know, now we have a new capability called DataPlane and Data Steward Studio so the idea there is around multi-everything, so more and more customers have multiple clusters whether it's on-prem, in the cloud, so now more and more customers are looking at how do I get a single glass pane of view across all my data whether it's on-prem, in the cloud, whether it's IOT, whether it's data at rest, right, so that's where DataPlane comes in and with the Data Steward Studio, which is our second offering on top of DataPlane, they can kind of get that view across all their clusters, so as soon as you know the data lands from Attunity into Atlas, you can get a view into that across as a part of Data Steward Studio, and one of the nice things we do in Data Steward Studio is that we also have machine learning models to do some profiling, to figure out that hey, this looks like a credit card, so maybe I should suggest this as a tag of sensitive data and now the end user, the end administration has the option of you know saying that okay, yeah, this is a credit card, I'll accept that tag, or they can reject that and pick one of their own. >> Will any of this going forward of the Attunity CDC change in the capture capability be containerized for deployment to the edges in HDP 3.0? I mean, 'cause it seems, I mean for internetive things, edge analytics and so forth, change data capture, is it absolutely necessary to make the entire, some call it the fog computing, cloud or whatever, to make it a completely transactional environment for all applications from micro endpoint to micro endpoint? Are there any plans to do that going forward? >> Yeah, so I think what HDP 3.0 as you mentioned right, one of the key factors that was coming into play was around time to value, so with containerization now being able to bring third-party apps on top of Yarn through Docker, I think that's definitely an avenue that we're looking at. >> Yes, we're excited about that with 3.0 as well, so that's definitely in the cards for us. >> Great, well, Ali and Dan, thank you so much for coming on theCUBE. It's fun to have you here. >> Nice to be here, thank you guys. >> Great to have you. >> Thank you, it was a pleasure. >> I'm Rebecca Knight, for James Kobielus, we will have more from DataWorks in San Jose just after this. (techno music)

Published Date : Jun 19 2018

SUMMARY :

to you by Hortonworks. He is the VP Product So I want to start with able to take those changes They are well known in this business. about taking the metadata that we capture Sure, for more of the into the play now, you at the DataWorks Berlin event but also all over the world, so the timing of our announcement of the Atlas integration. so the leading-edge work ISV of the Year as well, fact that we can come in, so it really, you know, that the data that they're using. right, so more and more the about the possibilities. that now they can, you know, is the way we structure that data in Hive, do the analytics in the spring, yeah. Yeah, the other side to forward of the Attunity CDC one of the key factors so that's definitely in the cards for us. It's fun to have you here. Kobielus, we will have more

ENTITIES

Entity	Category	Confidence
James Kobielus	PERSON	0.99+
Rebecca Knight	PERSON	0.99+
Dan Potter	PERSON	0.99+
Hortonworks	ORGANIZATION	0.99+
Ali Bajwah	PERSON	0.99+
Dan	PERSON	0.99+
Ali Bajwa	PERSON	0.99+
Ali	PERSON	0.99+
James Kobielus	PERSON	0.99+
Thursday morning	DATE	0.99+
San Jose	LOCATION	0.99+
Silicon Valley	LOCATION	0.99+
last year	DATE	0.99+
San Jose	LOCATION	0.99+
Attunity	ORGANIZATION	0.99+
Last year	DATE	0.99+
One	QUANTITY	0.99+
second piece	QUANTITY	0.99+
GDPR	TITLE	0.99+
Atlas	ORGANIZATION	0.99+
Thursday	DATE	0.99+
both	QUANTITY	0.99+
theCUBE	ORGANIZATION	0.98+
Ranger	ORGANIZATION	0.98+
second offering	QUANTITY	0.98+
DataWorks	ORGANIZATION	0.98+
Europe	LOCATION	0.98+
Atlas	TITLE	0.98+
Boston, Massachusetts	LOCATION	0.98+
today	DATE	0.97+
DataWorks Summit 2018	EVENT	0.96+
two main barriers	QUANTITY	0.95+
DataPlane Services	ORGANIZATION	0.95+
DataWorks Summit 2018	EVENT	0.94+
one	QUANTITY	0.93+
San Jose, California	LOCATION	0.93+
Docker	TITLE	0.9+
single glass	QUANTITY	0.87+
3.0	OTHER	0.85+
European	OTHER	0.84+
Attunity	PERSON	0.84+
Hive	LOCATION	0.83+
HDP 3.0	OTHER	0.82+
one nice thing	QUANTITY	0.82+
DataWorks Berlin	EVENT	0.81+
EU	ORGANIZATION	0.81+
first	QUANTITY	0.8+
DataPlane	TITLE	0.8+
EU	LOCATION	0.78+
EDW	TITLE	0.77+
Data Steward Studio	ORGANIZATION	0.73+
Hive	ORGANIZATION	0.73+
Data Steward Studio	TITLE	0.69+
single transaction	QUANTITY	0.68+
Ranger	TITLE	0.66+
Studio	COMMERCIAL_ITEM	0.63+
CDC	ORGANIZATION	0.58+
DataPlane	ORGANIZATION	0.55+
them	QUANTITY	0.53+
HDP 3.0	OTHER	0.52+

Arun Murthy, Hortonworks | DataWorks Summit 2018

>> Live from San Jose in the heart of Silicon Valley, it's theCUBE, covering DataWorks Summit 2018, brought to you by Hortonworks. >> Welcome back to theCUBE's live coverage of DataWorks here in San Jose, California. I'm your host, Rebecca Knight, along with my cohost, Jim Kobielus. We're joined by Aaron Murphy, Arun Murphy, sorry. He is the co-founder and chief product officer of Hortonworks. Thank you so much for returning to theCUBE. It's great to have you on >> Yeah, likewise. It's been a fun time getting back, yeah. >> So you were on the main stage this morning in the keynote, and you were describing the journey, the data journey that so many customers are on right now, and you were talking about the cloud saying that the cloud is part of the strategy but it really needs to fit into the overall business strategy. Can you describe a little bit about how you're approach to that? >> Absolutely, and the way we look at this is we help customers leverage data to actually deliver better capabilities, better services, better experiences, to their customers, and that's the business we are in. Now with that obviously we look at cloud as a really key part of it, of the overall strategy in terms of how you want to manage data on-prem and on the cloud. We kind of joke that we ourself live in a world of real-time data. We just live in it and data is everywhere. You might have trucks on the road, you might have drawings, you might have sensors and you have it all over the world. At that point, we've kind of got to a point where enterprise understand that they'll manage all the infrastructure but in a lot of cases, it will make a lot more sense to actually lease some of it and that's the cloud. It's the same way, if you're delivering packages, you don't got buy planes and lay out roads you go to FedEx and actually let them handle that view. That's kind of what the cloud is. So that is why we really fundamentally believe that we have to help customers leverage infrastructure whatever makes sense pragmatically both from an architectural standpoint and from a financial standpoint and that's kind of why we talked about how your cloud strategy, is part of your data strategy which is actually fundamentally part of your business strategy. >> So how are you helping customers to leverage this? What is on their minds and what's your response? >> Yeah, it's really interesting, like I said, cloud is cloud, and infrastructure management is certainly something that's at the foremost, at the top of the mind for every CIO today. And what we've consistently heard is they need a way to manage all this data and all this infrastructure in a hybrid multi-tenant, multi-cloud fashion. Because in some GEOs you might not have your favorite cloud renderer. You know, go to parts of Asia is a great example. You might have to use on of the Chinese clouds. You go to parts of Europe, especially with things like the GDPR, the data residency laws and so on, you have to be very, very cognizant of where your data gets stored and where your infrastructure is present. And that is why we fundamentally believe it's really important to have and give enterprise a fabric with which it can manage all of this. And hide the details of all of the underlying infrastructure from them as much as possible. >> And that's DataPlane Services. >> And that's DataPlane Services, exactly. >> The Hortonworks DataPlane Services we launched in October of last year. Actually I was on CUBE talking about it back then too. We see a lot of interest, a lot of excitement around it because now they understand that, again, this doesn't mean that we drive it down to the least common denominator. It is about helping enterprises leverage the key differentiators at each of the cloud renderers products. For example, Google, which we announced a partnership, they are really strong on AI and MO. So if you are running TensorFlow and you want to deal with things like Kubernetes, GKE is a great place to do it. And, for example, you can now go to Google Cloud and get DPUs which work great for TensorFlow. Similarly, a lot of customers run on Amazon for a bunch of the operational stuff, Redshift as an example. So the world we live in, we want to help the CIO leverage the best piece of the cloud but then give them a consistent way to manage and count that data. We were joking on stage that IT has just about learned how deal with Kerberos and Hadoob And now we're telling them, "Oh, go figure out IM on Google." which is also IM on Amazon but they are completely different. The only thing that's consistent is the name. So I think we have a unique opportunity especially with the open source technologies like Altas, Ranger, Knox and so on, to be able to draw a consistent fabric over this and secured occurrence. And help the enterprise leverage the best parts of the cloud to put a best fit architecture together, but which also happens to be a best of breed architecture. >> So the fabric is everything you're describing, all the Apache open source projects in which HortonWorks is a primary committer and contributor, are able to scheme as in policies and metadata and so forth across this distributed heterogeneous fabric of public and private cloud segments within a distributed environment. >> Exactly. >> That's increasingly being containerized in terms of the applications for deployment to edge nodes. Containerization is a big theme in HTP3.0 which you announced at this show. >> Yeah. >> So, if you could give us a quick sense for how that containerization capability plays into more of an edge focus for what your customers are doing. >> Exactly, great point, and again, the fabric is obviously, the core parts of the fabric are the open source projects but we've also done a lot of net new innovation with data plans which, by the way, is also open source. Its a new product and a new platform that you can actually leverage, to lay it out over the open source ones you're familiar with. And again, like you said, containerization, what is actually driving the fundamentals of this, the details matter, the scale at which we operate, we're talking about thousands of nodes, terabytes of data. The details really matter because a 5% improvement at that scale leads to millions of dollars in optimization for capex and opex. So that's why all of that, the details are being fueled and driven by the community which is kind of what we tell over HDP3 Until the key ones, like you said, are containerization because now we can actually get complete agility in terms of how you deploy the applications. You get isolation not only at the resource management level with containers but you also get it at the software level, which means, if two data scientists wanted to use a different version of Python or Scala or Spark or whatever it is, they get that consistently and holistically. That now they can actually go from the test dev cycle into production in a completely consistent manner. So that's why containers are so big because now we can actually leverage it across the stack and the things like MiNiFi showing up. We can actually-- >> Define MiNiFi before you go further. What is MiNiFi for our listeners? >> Great question. Yeah, so we've always had NiFi-- >> Real-time >> Real-time data flow management and NiFi was still sort of within the data center. What MiNiFi does is actually now a really, really small layer, a small thin library if you will that you can throw on a phone, a doorbell, a sensor and that gives you all the capabilities of NiFi but at the edge. >> Mmm Right? And it's actually not just data flow but what is really cool about NiFi it's actually command and control. So you can actually do bidirectional command and control so you can actually change in real-time the flows you want, the processing you do, and so on. So what we're trying to do with MiNiFi is actually not just collect data from the edge but also push the processing as much as possible to the edge because we really do believe a lot more processing is going to happen at the edge especially with the A6 and so on coming out. There will be custom hardware that you can throw and essentially leverage that hardware at the edge to actually do this processing. And we believe, you know, we want to do that even if the cost of data not actually landing up at rest because at the end of the day we're in the insights business not in the data storage business. >> Well I want to get back to that. You were talking about innovation and how so much of it is driven by the open source community and you're a veteran of the big data open source community. How do we maintain that? How does that continue to be the fuel? >> Yeah, and a lot of it starts with just being consistent. From day one, James was around back then, in 2011 we started, we've always said, "We're going to be open source." because we fundamentally believed that the community is going to out innovate any one vendor regardless of how much money they have in the bank. So we really do believe that's the best way to innovate mostly because their is a sense of shared ownership of that product. It's not just one vendor throwing some code out there try to shove it down the customers throat. And we've seen this over and over again, right. Three years ago, we talk about a lot of the data plane stuff comes from Atlas and Ranger and so on. None of these existed. These actually came from the fruits of the collaboration with the community with actually some very large enterprises being a part of it. So it's a great example of how we continue to drive it6 because we fundamentally believe that, that's the best way to innovate and continue to believe so. >> Right. And the community, the Apache community as a whole so many different projects that for example, in streaming, there is Kafka, >> Okay. >> and there is others that address a core set of common requirements but in different ways, >> Exactly. >> supporting different approaches, for example, they are doing streaming with stateless transactions and so forth, or stateless semantics and so forth. Seems to me that HortonWorks is shifting towards being more of a streaming oriented vendor away from data at rest. Though, I should say HDP3.0 has got great scalability and storage efficiency capabilities baked in. I wonder if you could just break it down a little bit what the innovations or enhancements are in HDP3.0 for those of your core customers, which is most of them who are managing massive multi-terabyte, multi-petabyte distributed, federated, big data lakes. What's in HDP3.0 for them? >> Oh for lots. Again, like I said, we obviously spend a lot of time on the streaming side because that's where we see. We live in a real-time world. But again, we don't do it at the cost of our core business which continues to be HDP. And as you can see, the community trend is drive, we talked about continuization massive step up for the Hadoob Community. We've also added support for GPUs. Again, if you think about Trove's at scale machine learning. >> Graphing processing units, >> Graphical-- >> AI, deep learning >> Yeah, it's huge. Deep learning, intensive flow and so on, really, really need a custom, sort of GPU, if you will. So that's coming. That's an HDP3. We've added a whole bunch of scalability improvements with HDFS. We've added federation because now we can go from, you can go over a billion files a billion objects in HDFS. We also added capabilities for-- >> But you indicated yesterday when we were talking that very few of your customers need that capacity yet but you think they will so-- >> Oh for sure. Again, part of this is as we enable more source of data in real-time that's the fuel which drives and that was always the strategy behind the HDF product. It was about, can we leverage the synergies between the real-time world, feed that into what you do today, in your classic enterprise with data at rest and that is what is driving the necessity for scale. >> Yes. >> Right. We've done that. We spend a lot of work, again, loading the total cost of ownership the TCO so we added erasure coding. >> What is that exactly? >> Yeah, so erasure coding is a classic sort of storage concept which allows you to actually in sort of, you know HTFS has always been three replicas So for redundancy, fault tolerance and recovery. Now, it sounds okay having three replicas because it's cheap disk, right. But when you start to think about our customers running 70, 80 hundred terabytes of data those three replicas add up because you've now gone from 80 terabytes of effective data where actually two 1/4 of an exobyte in terms of raw storage. So now what we can do with erasure coding is actually instead of storing the three blocks we actually store parody. We store the encoding of it which means we can actually go down from three to like two, one and a half, whatever we want to do. So, if we can get from three blocks to one and a half especially for your core data, >> Yeah >> the ones you're not accessing every day. It results in a massive savings in terms of your infrastructure costs. And that's kind of what we're in the business doing, helping customers do better with the data they have whether it's on-prem or on the cloud, that's sort of we want to help customers be comfortable getting more data under management along with secured and the lower TCO. The other sort of big piece I'm really excited about HDP3 is all the work that's happened to Hive Community for what we call the real-time database. >> Yes. >> As you guys know, you follow the whole sequel of ours in the Doob Space. >> And hive has changed a lot in the last several years, this is very different from what it was five years ago. >> The only thing that's same from five years ago is the name (laughing) >> So again, the community has done a phenomenal job, kind of, really taking sort of a, we used to call it like a sequel engine on HDFS. From there, to drive it with 3.0, it's now like, with Hive 3 which is part of HDP3 it's a full fledged database. It's got full asset support. In fact, the asset support is so good that writing asset tables is at least as fast as writing non-asset tables now. And you can do that not only on-- >> Transactional database. >> Exactly. Now not only can you do it on prem, you can do it on S3. So you can actually drive the transactions through Hive on S3. We've done a lot of work to actually, you were there yesterday when we were talking about some of the performance work we've done with LAP and so on to actually give consistent performance both on-prem and the cloud and this is a lot of effort simply because the performance characteristics you get from the storage layer with HDFS versus S3 are significantly different. So now we have been able to bridge those with things with LAP. We've done a lot of work and sort of enhanced the security model around it, governance and security. So now you get things like account level, masking, row-level filtering, all the standard stuff that you would expect and more from an Enprise air house. We talked to a lot of our customers, they're doing, literally tens of thousands of views because they don't have the capabilities that exist in Hive now. >> Mmm-hmm 6 And I'm sitting here kind of being amazed that for an open source set of tools to have the best security and governance at this point is pretty amazing coming from where we started off. >> And it's absolutely essential for GDPR compliance and compliance HIPA and every other mandate and sensitivity that requires you to protect personally identifiable information, so very important. So in many ways HortonWorks has one of the premier big data catalogs for all manner of compliance requirements that your customers are chasing. >> Yeah, and James, you wrote about it in the contex6t of data storage studio which we introduced >> Yes. >> You know, things like consent management, having--- >> A consent portal >> A consent portal >> In which the customer can indicate the degree to which >> Exactly. >> they require controls over their management of their PII possibly to be forgotten and so forth. >> Yeah, it's going to be forgotten, it's consent even for analytics. Within the context of GDPR, you have to allow the customer to opt out of analytics, them being part of an analytic itself, right. >> Yeah. >> So things like those are now something we enable to the enhanced security models that are done in Ranger. So now, it's sort of the really cool part of what we've done now with GDPR is that we can get all these capabilities on existing data an existing applications by just adding a security policy, not rewriting It's a massive, massive, massive deal which I cannot tell you how much customers are excited about because they now understand. They were sort of freaking out that I have to go to 30, 40, 50 thousand enterprise apps6 and change them to take advantage, to actually provide consent, and try to be forgotten. The fact that you can do that now by changing a security policy with Ranger is huge for them. >> Arun, thank you so much for coming on theCUBE. It's always so much fun talking to you. >> Likewise. Thank you so much. >> I learned something every time I listen to you. >> Indeed, indeed. I'm Rebecca Knight for James Kobeilus, we will have more from theCUBE's live coverage of DataWorks just after this. (Techno music)

Published Date : Jun 19 2018

SUMMARY :

brought to you by Hortonworks. It's great to have you on Yeah, likewise. is part of the strategy but it really needs to fit and that's the business we are in. And hide the details of all of the underlying infrastructure for a bunch of the operational stuff, So the fabric is everything you're describing, in terms of the applications for deployment to edge nodes. So, if you could give us a quick sense for Until the key ones, like you said, are containerization Define MiNiFi before you go further. Yeah, so we've always had NiFi-- and that gives you all the capabilities of NiFi the processing you do, and so on. and how so much of it is driven by the open source community that the community is going to out innovate any one vendor And the community, the Apache community as a whole I wonder if you could just break it down a little bit And as you can see, the community trend is drive, because now we can go from, you can go over a billion files the real-time world, feed that into what you do today, loading the total cost of ownership the TCO sort of storage concept which allows you to actually is all the work that's happened to Hive Community in the Doob Space. And hive has changed a lot in the last several years, And you can do that not only on-- the performance characteristics you get to have the best security and governance at this point and sensitivity that requires you to protect possibly to be forgotten and so forth. Within the context of GDPR, you have to allow The fact that you can do that now Arun, thank you so much for coming on theCUBE. Thank you so much. we will have more from theCUBE's live coverage of DataWorks

ENTITIES

Entity	Category	Confidence
Jim Kobielus	PERSON	0.99+
Rebecca Knight	PERSON	0.99+
James	PERSON	0.99+
Aaron Murphy	PERSON	0.99+
Arun Murphy	PERSON	0.99+
Arun	PERSON	0.99+
2011	DATE	0.99+
Google	ORGANIZATION	0.99+
5%	QUANTITY	0.99+
80 terabytes	QUANTITY	0.99+
FedEx	ORGANIZATION	0.99+
two	QUANTITY	0.99+
Silicon Valley	LOCATION	0.99+
Hortonworks	ORGANIZATION	0.99+
San Jose	LOCATION	0.99+
Amazon	ORGANIZATION	0.99+
Arun Murthy	PERSON	0.99+
HortonWorks	ORGANIZATION	0.99+
yesterday	DATE	0.99+
San Jose, California	LOCATION	0.99+
three replicas	QUANTITY	0.99+
James Kobeilus	PERSON	0.99+
three blocks	QUANTITY	0.99+
GDPR	TITLE	0.99+
Python	TITLE	0.99+
Europe	LOCATION	0.99+
millions of dollars	QUANTITY	0.99+
Scala	TITLE	0.99+
Spark	TITLE	0.99+
theCUBE	ORGANIZATION	0.99+
five years ago	DATE	0.99+
one and a half	QUANTITY	0.98+
Enprise	ORGANIZATION	0.98+
three	QUANTITY	0.98+
Hive 3	TITLE	0.98+
Three years ago	DATE	0.98+
both	QUANTITY	0.98+
Asia	LOCATION	0.97+
50 thousand	QUANTITY	0.97+
TCO	ORGANIZATION	0.97+
MiNiFi	TITLE	0.97+
Apache	ORGANIZATION	0.97+
40	QUANTITY	0.97+
Altas	ORGANIZATION	0.97+
Hortonworks DataPlane Services	ORGANIZATION	0.96+
DataWorks Summit 2018	EVENT	0.96+
30	QUANTITY	0.95+
thousands of nodes	QUANTITY	0.95+
A6	COMMERCIAL_ITEM	0.95+
Kerberos	ORGANIZATION	0.95+
today	DATE	0.95+
Knox	ORGANIZATION	0.94+
one	QUANTITY	0.94+
hive	TITLE	0.94+
two data scientists	QUANTITY	0.94+
each	QUANTITY	0.92+
Chinese	OTHER	0.92+
TensorFlow	TITLE	0.92+
S3	TITLE	0.91+
October of last year	DATE	0.91+
Ranger	ORGANIZATION	0.91+
Hadoob	ORGANIZATION	0.91+
HIPA	TITLE	0.9+
CUBE	ORGANIZATION	0.9+
tens of thousands	QUANTITY	0.9+
one vendor	QUANTITY	0.89+
last several years	DATE	0.88+
a billion objects	QUANTITY	0.86+
70, 80 hundred terabytes of data	QUANTITY	0.86+
HTP3.0	TITLE	0.86+
two 1/4 of an exobyte	QUANTITY	0.86+
Atlas and	ORGANIZATION	0.85+
DataPlane Services	ORGANIZATION	0.84+
Google Cloud	TITLE	0.82+

Alan Gates, Hortonworks | Dataworks Summit 2018

(techno music) >> (announcer) From Berlin, Germany it's theCUBE covering DataWorks Summit Europe 2018. Brought to you by Hortonworks. >> Well hello, welcome to theCUBE. We're here on day two of DataWorks Summit 2018 in Berlin, Germany. I'm James Kobielus. I'm lead analyst for Big Data Analytics in the Wikibon team of SiliconANGLE Media. And who we have here today, we have Alan Gates whose one of the founders of Hortonworks and Hortonworks of course is the host of DataWorks Summit and he's going to be, well, hello Alan. Welcome to theCUBE. >> Hello, thank you. >> Yeah, so Alan, so you and I go way back. Essentially, what we'd like you to do first of all is just explain a little bit of the genesis of Hortonworks. Where it came from, your role as a founder from the beginning, how that's evolved over time but really how the company has evolved specifically with the folks on the community, the Hadoop community, the Open Source community. You have a deepening open source stack with you build upon with Atlas and Ranger and so forth. Gives us a sense for all of that Alan. >> Sure. So as I think it's well-known, we started as the team at Yahoo that really was driving a lot of the development of Hadoop. We were one of the major players in the Hadoop community. Worked on that for, I was in that team for four years. I think the team itself was going for about five. And it became clear that there was an opportunity to build a business around this. Some others had already started to do so. We wanted to participate in that. We worked with Yahoo to spin out Hortonworks and actually they were a great partner in that. Helped us get than spun out. And the leadership team of the Hadoop team at Yahoo became the founders of Hortonworks and brought along a number of the other engineering, a bunch of the other engineers to help get started. And really at the beginning, we were. It was Hadoop, Pig, Hive, you know, a few of the very, Hbase, the kind of, the beginning projects. So pretty small toolkit. And we were, our early customers were very engineering heavy people, or companies who knew how to take those tools and build something directly on those tools right? >> Well, you started off with the Hadoop community as a whole started off with a focus on the data engineers of the world >> Yes. >> And I think it's shifted, and confirm for me, over time that you focus increasing with your solutions on the data scientists who are doing the development of the applications, and the data stewards from what I can see at this show. >> I think it's really just a part of the adoption curve right? When you're early on that curve, you have people who are very into the technology, understand how it works, and want to dive in there. So those tend to be, as you said, the data engineering types in this space. As that curve grows out, you get, it comes wider and wider. There's still plenty of data engineers that are our customers, that are working with us but as you said, the data analysts, the BI people, data scientists, data stewards, all those people are now starting to adopt it as well. And they need different tools than the data engineers do. They don't want to sit down and write Java code or you know, some of the data scientists might want to work in Python in a notebook like Zeppelin or Jupyter but some, may want to use SQL or even Tablo or something on top of SQL to do the presentation. Of course, data stewards want tools more like Atlas to help manage all their stuff. So that does drive us to one, put more things into the toolkit so you see the addition of projects like Apache Atlas and Ranger for security and all that. Another area of growth, I would say is also the kind of data that we're focused on. So early on, we were focused on data at rest. You know, we're going to store all this stuff in HDFS and as the kind of data scene has evolved, there's a lot more focus now on a couple things. One is data, what we call data-in-motion for our HDF product where you've got in a stream manager like Kafka or something like that >> (James) Right >> So there's processing that kind of data. But now we also see a lot of data in various places. It's not just oh, okay I have a Hadoop cluster on premise at my company. I might have some here, some on premise somewhere else and I might have it in several clouds as well. >> K, your focus has shifted like the industry in general towards streaming data in multi-clouds where your, it's more stateful interactions and so forth? I think you've made investments in Apache NiFi so >> (Alan) yes. >> Give us a sense for your NiFi versus Kafka and so forth inside of your product strategy or your >> Sure. So NiFi is really focused on that data at the edge, right? So you're bringing data in from sensors, connected cars, airplane engines, all those sorts of things that are out there generating data and you need, you need to figure out what parts of the data to move upstream, what parts not to. What processing can I do here so that I don't have to move upstream? When I have a error event or a warning event, can I turn up the amount of data I'm sending in, right? Say this airplane engine is suddenly heating up maybe a little more than it's supposed to. Maybe I should ship more of the logs upstream when the plane lands and connects that I would if, otherwise. That's the kind o' thing that Apache NiFi focuses on. I'm not saying it runs in all those places by my point is, it's that kind o' edge processing. Kafka is still going to be running in a data center somewhere. It's still a pretty heavy weight technology in terms of memory and disk space and all that so it's not going to be run on some sensor somewhere. But it is that data-in-motion right? I've got millions of events streaming through a set of Kafka topics watching all that sensor data that's coming in from NiFi and reacting to it, maybe putting some of it in the data warehouse for later analysis, all those sorts of things. So that's kind o' the differentiation there between Kafka and NiFi. >> Right, right, right. So, going forward, do you see more of your customers working internet of things projects, is that, we don't often, at least in the industry of popular mind, associate Hortonworks with edge computing and so forth. Is that? >> I think that we will have more and more customers in that space. I mean, our goal is to help our customers with their data wherever it is. >> (James) Yeah. >> When it's on the edge, when it's in the data center, when it's moving in between, when it's in the cloud. All those places, that's where we want to help our customers store and process their data. Right? So, I wouldn't want to say that we're going to focus on just the edge or the internet of things but that certainly has to be part of our strategy 'cause it's has to be part of what our customers are doing. >> When I think about the Hortonworks community, now we have to broaden our understanding because you have a tight partnership with IBM which obviously is well-established, huge and global. Give us a sense for as you guys have teamed more closely with IBM, how your community has changed or broadened or shifted in its focus or has it? >> I don't know that it's shifted the focus. I mean IBM was already part of the Hadoop community. They were already contributing. Obviously, they've contributed very heavily on projects like Spark and some of those. They continue some of that contribution. So I wouldn't say that it's shifted it, it's just we are working more closely together as we both contribute to those communities, working more closely together to present solutions to our mutual customer base. But I wouldn't say it's really shifted the focus for us. >> Right, right. Now at this show, we're in Europe right now, but it doesn't matter that we're in Europe. GDPR is coming down fast and furious now. Data Steward Studio, we had the demonstration today, it was announced yesterday. And it looks like a really good tool for the main, the requirements for compliance which is discover and inventory your data which is really set up a consent portal, what I like to refer to. So the data subject can then go and make a request to have my data forgotten and so forth. Give us a sense going forward, for how or if Hortonworks, IBM, and others in your community are going to work towards greater standardization in the functional capabilities of the tools and platforms for enabling GDPR compliance. 'Cause it seems to me that you're going to need, the industry's going to need to have some reference architecture for these kind o' capabilities so that going forward, either your ecosystem of partners can build add on tools in some common, like the framework that was laid out today looks like a good basis. Is there anything that you're doing in terms of pushing towards more Open Source standardization in that area? >> Yes, there is. So actually one of my responsibilities is the technical management of our relationship with ODPI which >> (James) yes. >> Mandy Chessell referenced yesterday in her keynote and that is where we're working with IBM, with ING, with other companies to build exactly those standards. Right? Because we do want to build it around Apache Atlas. We feel like that's a good tool for the basis of that but we know one, that some people are going to want to bring their own tools to it. They're not necessarily going to want to use that one platform so we want to do it in an open way that they can still plug in their metadata repositories and communicate with others and we want to build the standards on top of that of how do you properly implement these features that GDPR requires like right to be forgotten, like you know, what are the protocols around PIII data? How do you prevent a breach? How do you respond to a breach? >> Will that all be under the umbrella of ODPI, that initiative of the partnership or will it be a separate group or? >> Well, so certainly Apache Atlas is part of Apache and remains so. What ODPI is really focused up is that next layer up of how do we engage, not the programmers 'cause programmers can gage really well at the Apache level but the next level up. We want to engage the data professionals, the people whose job it is, the compliance officers. The people who don't sit and write code and frankly if you connect them to the engineers, there's just going to be an impedance mismatch in that conversation. >> You got policy wonks and you got tech wonks so. They understand each other at the wonk level. >> That's a good way to put it. And so that's where ODPI is really coming is that group of compliance people that speak a completely different language. But we still need to get them all talking to each other as you said, so that there's specifications around. How do we do this? And what is compliance? >> Well Alan, thank you very much. We're at the end of our time for this segment. This has been great. It's been great to catch up with you and Hortonworks has been evolving very rapidly and it seems to me that, going forward, I think you're well-positioned now for the new GDPR age to take your overall solution portfolio, your partnerships, and your capabilities to the next level and really in terms of in an Open Source framework. In many ways though, you're not entirely 100% like nobody is, purely Open Source. You're still very much focused on open frameworks for building fairly scalable, very scalable solutions for enterprise deployment. Well, this has been Jim Kobielus with Alan Gates of Hortonworks here at theCUBE on theCUBE at DataWorks Summit 2018 in Berlin. We'll be back fairly quickly with another guest and thank you very much for watching our segment. (techno music)

Published Date : Apr 19 2018

SUMMARY :

Brought to you by Hortonworks. of Hortonworks and Hortonworks of course is the host a little bit of the genesis of Hortonworks. a bunch of the other engineers to help get started. of the applications, and the data stewards So those tend to be, as you said, the data engineering types But now we also see a lot of data in various places. So NiFi is really focused on that data at the edge, right? So, going forward, do you see more of your customers working I mean, our goal is to help our customers with their data When it's on the edge, when it's in the data center, as you guys have teamed more closely with IBM, I don't know that it's shifted the focus. the industry's going to need to have some So actually one of my responsibilities is the that GDPR requires like right to be forgotten, like and frankly if you connect them to the engineers, You got policy wonks and you got tech wonks so. as you said, so that there's specifications around. It's been great to catch up with you and

ENTITIES

Entity	Category	Confidence
IBM	ORGANIZATION	0.99+
James Kobielus	PERSON	0.99+
Mandy Chessell	PERSON	0.99+
Alan	PERSON	0.99+
Yahoo	ORGANIZATION	0.99+
Jim Kobielus	PERSON	0.99+
Europe	LOCATION	0.99+
Hortonworks	ORGANIZATION	0.99+
Alan Gates	PERSON	0.99+
four years	QUANTITY	0.99+
James	PERSON	0.99+
ING	ORGANIZATION	0.99+
Berlin	LOCATION	0.99+
yesterday	DATE	0.99+
Apache	ORGANIZATION	0.99+
SQL	TITLE	0.99+
Java	TITLE	0.99+
GDPR	TITLE	0.99+
Python	TITLE	0.99+
100%	QUANTITY	0.99+
Berlin, Germany	LOCATION	0.99+
SiliconANGLE Media	ORGANIZATION	0.99+
DataWorks Summit	EVENT	0.99+
Atlas	ORGANIZATION	0.99+
DataWorks Summit 2018	EVENT	0.98+
Data Steward Studio	ORGANIZATION	0.98+
today	DATE	0.98+
one	QUANTITY	0.98+
NiFi	ORGANIZATION	0.98+
Dataworks Summit 2018	EVENT	0.98+
Hadoop	ORGANIZATION	0.98+
one platform	QUANTITY	0.97+
2018	EVENT	0.97+
both	QUANTITY	0.97+
millions of events	QUANTITY	0.96+
Hbase	ORGANIZATION	0.95+
Tablo	TITLE	0.95+
ODPI	ORGANIZATION	0.94+
Big Data Analytics	ORGANIZATION	0.94+
One	QUANTITY	0.93+
theCUBE	ORGANIZATION	0.93+
NiFi	COMMERCIAL_ITEM	0.92+
day two	QUANTITY	0.92+
about five	QUANTITY	0.91+
Kafka	TITLE	0.9+
Zeppelin	ORGANIZATION	0.89+
Atlas	TITLE	0.85+
Ranger	ORGANIZATION	0.84+
Jupyter	ORGANIZATION	0.83+
first	QUANTITY	0.82+
Apache Atlas	ORGANIZATION	0.82+
Hadoop	TITLE	0.79+

Greg Fee, Lyft | Flink Forward 2018

>> Narrator: Live from San Francisco, it's theCUBE covering Flink Forward brought to you by Data Artisans. >> This is George Gilbert. We are at Data Artisan's conference Flink Forward. It is for the Apache Flink commmunity, sponsored by Data Artisans, and all the work they're doing to move Flink Forward, and to surround it with additional value that makes building stream-processing applications accessible to mainstream companies. Right now though, we are not talking to a mainstream company, we're talking to Greg Fee from Lyft. Not Uber. (laughs) And Greg tell us a little bit about what you're doing with Flink. What's the first-use case, that comes to mind that really exercises its capabilities? >> Sure, yeah, so the process of adopting Flink at Lyft has really started with a use case, which was, we're trying to make machine learning more accessible across all of Lyft. So we already use machine learning in quite a few applications, but we want to make sure that we use machine learning as much as possible, we really think that's the path forward. And one of the fundamental difficulties with that is having consistent feature generation between these offline batch-y training scenarios and the online real-time streaming scenarios. And the unified processing engine of Flink really helps us bridge that gap, so. >> When you say unified processing engine, are you saying that the fact that you can manage code and data, as sort of an application version, and some of the, either code or data, is part of the model, and so your versioning? >> That's even a step beyond what I'm talking about. >> Okay. >> Just the basic fundamental ability to have one piece of business logic that you can apply at the batch bulk layer, and in the real-time layer. >> George: Yeah. >> So that's sort of like the core of what Flink gives you. >> Are you running both batch and streaming on Flink? >> Yes, that's right. >> And using the, so, you're using the windows? Or just periodic execution on a stream to simulate batch? >> That's right. So we have, so feature generation crosses a broad spectrum of possible use cases in Flink. >> George: Yeah. >> And this is where we sort of transition more into what dA platform could give for us. So, we're looking to have thousands of different features across all of our machine learning models. So having a platform that can help us host many of these little programs running, help with the application life-cycle of each of these features, as we version them over time. So, we're very excited about what dA platform can do for us. >> Can you tell us a little more about how the stream processing helps you with the feature selection engineering, and is it that you're using streaming, or simulated batch, or batch using the same programming model to train these models, and you're using, you're picking up different derived data, is that how it's working? >> So, typical life-cycle is, it's going to be a feature engineering stage, so the data scientist is looking at their data, they're trying figure out patterns in the data, and they're going to, how you apply Flink there, is as you come up with potential algorithms for how you generate your feature, can run that through Flink, generate some data, apply machine learning model on top of it, and sort of play around with that data, prototype things. >> So, what you're doing is offline, or out of the platform, you're doing the feature selection and the engineering. >> Man: Right. >> Then you attach a stream to it that has just the relevant, perhaps, the relevant features. >> Man: Right. >> And then that model gets sort of, well maybe not yet, but eventually versioned as part of the application, which includes the application, the rest of the application logic and the data. >> Right. So, like some of the stuff that was touched on this morning at the keynotes, the versioning and maintaining machine learning applications, is a much, is a very complex ecosystem there. So being able to say, okay, going from the prototype stage, doing stuff in batch, to doing stuff in production, and real-time, then being able to version those over time, to move to better and better versions of the future generation, is very important to us. >> I don't know if this is the most politically correct thing, but you just explained it better than everyone else we have talked to. >> Great. (laughs) >> About how it all fits together with the machine learning. So, once you've got that in place, it sounds like you're using the dA platform, as well as, you know, perhaps some extensions for machine learning, to sort of add that as a separate life-cycle, besides the application code. Then, is that going to be the enterprise-wide platform for deploying, developing and deploying, machine learning applications? >> Yes, certainly we think there's probably a broad ecosystem to do machine learning. It's a very, sort of, wide open area. Certainly my agenda is to push it across the company and get as many things running in this system as possible. I think the real-time aspects of it, a unifying aspect, of what Flink can give us, and the platform can give us, in terms of the life-cycles. >> So, are you set up essentially like where you're the, a shared resource, a shared service, which is the platform group? >> Man: Right. >> And then, all the business units, adopt that platform and build their apps on it. >> Right. So my initiative is part of a greater data science platform at Lyft, so, my goal is to have, we have hundreds of data scientists who are going to be looking at this data, giving me little features that they want to do, and we're probably going to end up numbering in the thousands of features, being able to generate all those, maintain all those little programs. >> And when you say generate all those little programs, that's the application logic, and the models specific to that application? >> That's right, well. >> Or is it this? >> There's features that are typically shared across many models. >> Okay. >> So there's like two layers of things happening. >> So you're managing features separately from the models. >> That's right. >> Interesting. Okay, haven't heard that. And is the application manager tooling going to help address that, or is that custom stuff that you have to do? >> So, I think there's, I think there's a potential that that's the way we're going to manage the model stuff as well, but it's still little new over there. >> That you put it on the application platform? >> Right. >> Then that's sort of at the boundary of what you're doing right now, or what you will be doing shortly. >> Right. It's all, it's a matter of use-case, whether it's online or offline, and how it fits best in with the rest of the Lyft engineering system. >> When you're talking about your application landscape, do you have lots of streaming applications that feed other streaming applications, going through a hub. Or, are they sort of more discrete, you know, artifacts, discrete programs, and then when do you keep, stay within the streaming processors, and when do you have it in a shared database? >> That's a, that's a lot of questions, kind of a deep question. So, the goal is to have a central hub, where sort of all of our event data passes through it, and that allows us to decouple. >> So that's to be careful, that's not a database central hub, that's a, like a? >> An event hub. >> Event hub. >> Right. >> Yeah, okay. >> So, an event hub in the middle allows us to decompose the different, sort of smaller programs, which again are probably going to number in the thousands, so that being able to have different parts of the company maintain their own part of the overall system is very important to us. I think we'll probably see Flink as a major player, in terms of how those programs run, but we'll probably be shooting things off to other systems like Druid, like Hive, like Presto, like Elasticsearch. >> As derived data? >> As all derived data, from these Flink jobs. And then also, pushing data directly out into some of our production systems to feed into machine learning decisions. >> Okay, this is quite, sounds like the most ambitious infrastructure that we've heard, in that it sounds like pretty ubiquitous. >> We want to be a machine-learning first company. So, it's everywhere. >> So, now help me clarify for me, when? Because this is, you know, for mainstream companies who've programmed with, you know, DBMS, as a shared state manager for decades, help explain to them when you would still use a DBMS for shared state, and when you would start using the distributed state that's embedded in Flink, and the derived data, you know, at the endpoints, at the syncs. >> So I mean, I guess this kind of gets into your exact, your use cases and, you know, your opinions and thoughts about how to use these things best, but. >> George: Your opinion is what we're interested in. >> Right. From where I'm coming, I see basically databases as potential one sync for this data. They do things very well, right? They do structured queries very well. You can have indices built off that, aggregates, really feed into a lot of visualization stuff. >> George: Yeah. >> But, from where I am sitting, like we're really moving away from databases as something that feeds production data. We've got other stores to do that, that are sort of more tailored towards those scenarios. >> When you say to feed production data, this is transaction capture, or data capture. >> Right. So we don't have a lot of atomic transactions, outside the payments at Lyft, most of the stuff is eventually consistent. So we have stores, more like Dynamo or Cassandra HBase that feed a lot of our production data. >> And those databases, are they for like ambient information like influencing an interaction, it doesn't sound like automating a transaction. It would be, it sounds like, context that helps with analytics, but very separate from the OLTP apps. >> That's right. So we have, you can kind of bifurcate the company into the data that's used in production to make decisions that are like facing the user, and then our analytics back end, that really helps business analysts and like the executives make decisions about how we proceed. >> And so that second part, that backend, is more like operational efficiency. >> Man: Right. >> And coding new business processes to support new ways of doing business, but the customer-facing stuff specifically like with payments, that still needs a traditional OLTP. >> Man: Right. >> But there not, those use cases aren't growing that much. >> That's right. So, basically we have very specific use-cases for like a traditional database, but in terms of capturing the types of scale, and the type of growth, we're looking for at Lyft, we think some of the other storage engines suit those better. >> So in that use-case, would the OLTP DBMS be at the front end, would it be a source, or a sync? It sounds like it's a source. >> So we actually do it both ways. Right, so, it's great to get our transactional data flowing through our streaming system, it's a lot of value in that, but also then pushing it out, back to some of the aggregate results to DBMS, helps with our analytics pipeline. >> Okay, okay. Well this is actually really interesting. So, where do you see the dA platform helping, you know, going forward; is it something you don't really need because you've built all that scaffolding to help with sort of application life-cycle management, or or do you see it as something that'll help sort of push Flink sort of enterprise-wide? >> I think the dA platform really helps people sort of adopt Flink at an enterprise level. Maintaining the applications is a core part of what it means to run it as a business. And so we're looking at dA platform as a way of managing our applications, and I think, like I'm just talking about one, I'm mostly talking about one application we have for Flink at Lyft. >> Yeah. >> We have many other Flink programs actually running, that are sort of unrelated to my project. >> What about managing non-Flink applications? Do you need an application manager? Is it okay that it's associated with one service, or platform like Flink, or is there a desire you know among bleeding edge customers to have an overall, sort of infrastructure management, application management kind of suite. >> Yes, for sure. You're touching on something I have started to push inside of Lyft, which is the need for an overall application life-cycle management product that's not technology specific. >> Would these sort of plug into the dA platform and whatever the confluent, you know, equivalent is, or is it going to to directly tie to the, you know, operational capabilities, or the functional capabilities, not the management capabilities. In other words would it plug into like core Flink, core Kafka, core Spark, that sort of stuff? >> I think that's sort of largely to be determined. If you go back to sort of how distributed design system works, typically. We have a user plane, which is going to be our data users. Then you end up with the thing we're probably most familiar with, which is our data plane, technologies like Flink and Kafka and Hive, all those guys. What's missing in the middle right now is a control plane. It's a map from the user desire, from the user intention, to what we do with all of that data plane stuff. So launch a new program, maybe you need a new Kafka topic, maybe you need to provision in Kafka. Higher, you need to get some Flink programs running, and whether that talks directly talks to Flink, and goes against Kubernetes, or something like that, or whether it talks to a higher level, like more application-specific platform. >> Man: Yeah. >> I think, you know it's certainly a lot easier, if we have some of these platforms in the way. >> Because they give you better abstractions. >> That's right. >> To talk to the platforms. >> That's right. >> That's interesting. Okay, geesh, we learn something really, really interesting with each interview. I'm curious though, if you look out a couple years, how much of your application landscape will be continuous processing, and is that something you can see mainstream enterprises adopting, or has decades of work with, you know, batch and interactive sort of made people too difficult to learn something so radically new? >> I think it's all going to be driven by the business needs, and whether the value is there for people to make that transition 'cause it is quite expensive to invest in new infrastructure. For companies like Lyft, where we're trying to make decisions very quickly, you know, users get down to two seconds makes a difference for the customer, so we're trying to be as, you know, real-time as possible. I used to work at Salesforce. Salespeople are a little less sensitive to these things, and you know it's very, very traditional world. >> That's interesting. (background applauding) >> But even Salesforce is moving towards that style. >> Even Salesforce is moving? >> Is moving toward streaming processing. >> Really? >> George: So like, I think we're going to see it slowly be adopted across the big enterprises. >> George: I imagine that's probably for their analytics. >> That's where they're starting, of course, yeah. >> Okay. So, this was, a little more affirmation on to how we're going to see the control plane evolve, and the interesting use-cases that you're up to. I hope we can see you back next year. And you can tell us how far you've proceeded. >> I certainly hope so, yeah. >> This was really interesting. So, Greg Fee from Lyft. We will hopefully see you again. And this is George Gilbert. We're at the Data Artisans Flink Forward conference in San Francisco. We'll be back after this break. (techno music)

Published Date : Apr 12 2018

SUMMARY :

brought to you by Data Artisans. What's the first-use case, that comes to mind And one of the fundamental difficulties with that That's even a step beyond what Just the basic fundamental ability to have So we have, so feature generation crosses a broad So having a platform that can help us host with potential algorithms for how you So, what you're doing is offline, or out of the platform, Then you attach a stream to it that has just of the application logic and the data. So, like some of the stuff that was touched on politically correct thing, but you just explained (laughs) Then, is that going to be the enterprise-wide platform in terms of the life-cycles. and build their apps on it. in the thousands of features, being able to generate There's features that are typically And is the application manager tooling going to help that that's the way we're going to manage the model stuff Then that's sort of at the boundary of what you're of the Lyft engineering system. and when do you have it in a shared database? So, the goal is to have a central hub, So, an event hub in the middle allows us to decompose some of our production systems to feed into Okay, this is quite, sounds like the most ambitious So, it's everywhere. and the derived data, you know, at the endpoints, about how to use these things best, but. into a lot of visualization stuff. We've got other stores to do that, that are sort of When you say to feed production data, outside the payments at Lyft, most of the stuff And those databases, are they for like ambient information So we have, you can kind of bifurcate the company And so that second part, that backend, is more like of doing business, but the customer-facing stuff the types of scale, and the type of growth, we're looking be at the front end, would it be a source, or a sync? some of the aggregate results to DBMS, So, where do you see the dA platform helping, you know, Maintaining the applications is a core part actually running, that are sort of unrelated to my project. you know among bleeding edge customers to have an overall, inside of Lyft, which is the need for an overall application or is it going to to directly tie to the, you know, to what we do with all of that data plane stuff. I think, you know it's certainly a lot easier, or has decades of work with, you know, and you know it's very, That's interesting. that style. adopted across the big enterprises. I hope we can see you back next year. We're at the Data Artisans Flink Forward conference

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
George	PERSON	0.99+
Greg	PERSON	0.99+
Greg Fee	PERSON	0.99+
Data Artisans	ORGANIZATION	0.99+
San Francisco	LOCATION	0.99+
Lyft	ORGANIZATION	0.99+
thousands	QUANTITY	0.99+
next year	DATE	0.99+
second part	QUANTITY	0.99+
Uber	ORGANIZATION	0.99+
each interview	QUANTITY	0.99+
Dynamo	ORGANIZATION	0.99+
Salesforce	ORGANIZATION	0.99+
Apache	ORGANIZATION	0.98+
Flink	ORGANIZATION	0.98+
one service	QUANTITY	0.98+
two layers	QUANTITY	0.98+
two seconds	QUANTITY	0.98+
each	QUANTITY	0.97+
thousands of features	QUANTITY	0.97+
both ways	QUANTITY	0.97+
Kafka	TITLE	0.93+
first-use case	QUANTITY	0.92+
one application	QUANTITY	0.92+
Druid	TITLE	0.92+
Flink Forward	TITLE	0.92+
decades	QUANTITY	0.91+
Elasticsearch	TITLE	0.89+
Data Artisans Flink Forward	EVENT	0.89+
one	QUANTITY	0.89+
Artisan	EVENT	0.87+
first company	QUANTITY	0.87+
hundreds of data scientists	QUANTITY	0.87+
both batch	QUANTITY	0.84+
one piece	QUANTITY	0.83+
2018	DATE	0.81+
Flink	TITLE	0.8+
Hive	TITLE	0.77+
Presto	TITLE	0.76+
this morning	DATE	0.75+
features	QUANTITY	0.74+
couple	QUANTITY	0.73+
Flink Forward	EVENT	0.69+
Hive	ORGANIZATION	0.65+
Spark	TITLE	0.62+
Kubernetes	ORGANIZATION	0.61+
Data	ORGANIZATION	0.6+
Cassandra HBase	ORGANIZATION	0.57+

Jagane Sundar, WANdisco | AWS Summit SF 2018

>> Voiceover: Live from the Moscone Center, it's theCUBE. Covering AWS Summit San Francisco 2018. Brought to you by Amazon Web Services. >> Welcome back, I'm Stu Miniman and this is theCUBE's exclusive coverage of AWS Summit here in San Francisco. Happy to welcome back to the program Jagane Sundar, who is the CTO of WANdisco. Jagane, great to see you, how have you been? >> Well, been great Stu, thanks for having me. >> All right so, every show we go to now, data really is at the center of it, you know. I'm an infrastructure guy, you know, data is so much of the discussion here, here in the cloud in the keynotes, they were talking about it. IOT of course, data is so much involved in it. We've watched WANdisco from the days that we were talking about big data. Now it's you know, there's AI, there's ML. Data's involved, but tell us what is WANdisco's position in the marketplace today, and the updated role on data? >> So, we have this notion, this brand new industry segment called live data. Now this is more than just itty-bitty data or big data, in fact this is cloud-scale data located in multiple regions around the world and changing all the time. So you have East Coast data centers with data, West Coast data centers with data, European data centers with data, all of this is changing at the same time. Yet, your need for analytics and business intelligence based on that is across the board. You want your analytics to be consistent with the data from all these locations. That, in a sense, is the live data problem. >> Okay, I think I understand it but, you know, we're not talking about like, in the storage world there was like hot data, what's hot and cold data. And we talked about real-time data for streaming data and everything like that. But how do you compare and contrast, you know, you said global in scope, talked about multi-region, really talking distributed. From an architectural standpoint, what's enabling that to be kind of the discussion today? Is it the likes of Amazon and their global reach? And where does WANdisco fit into the picture? >> So Amazon's clearly a factor in this. The fact that you can start up a virtual machine in any part of the world in a matter of minutes and have data accessible to that VM in an instant changes the business of globally accessible data. You're not simply talking about a primary data center and a disaster recovery data center anymore. You have multiple data centers, the data's changing in all those places, and you want analytics on all of the data, not part of the data, not on the primary data center, how do you accomplish that, that's the challenge. >> Yeah, so drill into it a little bit for us. Is this a replication technology? Is this just a service that I can spin up? When you say live, can I turn it off? How do those kind of, when I think about all the cloud dynamics and levers? >> So it is indeed based on active-active replication, using a mathematically strong algorithm called Paxos. In a minute, I'll contrast that with other replication technologies, but the essence of this is that by using this replication technology as a service, so if you are going up to Amazon's web services and you're purchasing some analytics engine, be it Hive or Redshift or any analytics engine, and you want to have that be accessible from multiple data centers, be available in the face of data center or entire region failure, and the data should be accessible, then you go with our live data platform. >> Yeah so, we want you to compare and contrast. What I think about, you know, I hear active-active, speed of light's always a challenge. You know globally, you have inconsistency it's challenging, there's things like Google Spanner out there to look at those. You know, how does this fit compared to the way we've thought of things like replication and globally distributed systems in the past? >> Interesting question. So, ours great for analytics applications, but something like Google Spanner is more like a MySQL database replacement that runs into multiple data centers. We don't cater to that and database-transaction type of applications. We cater to analytics applications of batch, very fast streaming applications, enterprise data warehouse-type analytics applications, for all of those. Now if you take a look inside and see what kind of replication technology will be used, you'll find that we're better than the other two different types. There are two different types of existing replication technologies. One is log shipping. The traditional Oracle, GoldenGate-type, ship the log, once the change is made to the primary. The second is, take a snapshot and copy differences between snapshots. Both have their deficiencies. Snapshot of course is time-based, and it happens once in a while. You'll be lucky if you can get one day RTO with those sorts of things. Also, there's an interesting anecdote that comes to mind when I say that because the Hadoop folks in their HTFS, implemented a version of snapshot and snapdiff. The unfortunate truth is that it was engineered such that, if you have a lot of changes happening, the snapshot and snapdiff code might consume too much memory and bring down your NameNode. That's undesirable, now your backup facility just brought down your main data capability. So snapshot has its deficiencies. Log shipping is always active/passive. Contrast that with our technology of live data, whereat you can have multiple data centers filled with data. You can write your data to any of these data centers. It makes for a much more capable system. >> Okay, can you explain, how does this fit with AWS and can it live in multi-clouds, what about on-premises, the whole you know, multi and hybrid cloud discussion? >> Interesting, so the answer is yes. It can live in multiple regions within the same cloud, multiple reasons within different clouds. It'll also bridge data that exists on your on-prem, Hadoop or other big data systems, or object store systems within Cloud, S3 or Azure, or any of the BLOB stores available in the cloud. And when I say this, I mean in a live data fashion. That means you can write to your on-prem storage, you can also write to your cloud buckets at the same time. We'll keep it consistent and replicated. >> Yeah, what are you hearing from customers when it comes to where their data lives? I know last time I interviewed David Richards, your CEO, he said the data lakes really used to be on premises, now there's a massive shift moving to the public clouds. Is that continuing, what's kind of the breakdown, what are you hearing from customers? >> So I cannot name a single customer of ours who is not thinking about the cloud. Every one of them has a presence on premise. They're looking to grow in the cloud. On-prem does not appear to be on a growth path for them. They're looking at growing in the cloud, they're looking at bursting into the cloud, and they're almost all looking at multi-cloud as well. That's been our experience. >> At the beginning of the conversation we talked about data. How are customers doing you know, exploiting and leveraging or making sure that they aren't having data become a liability for them? >> So there are so many interesting use cases I'd love to talk about, but the one that jumps out at me is a major auto manufacturer. Telematics data coming in from a huge number, hundreds of thousands, of cars on the road. They chose to use our technology because they can feed their West Coast car telematics into their West Coast data center, while simultaneously writing East Coast car data into the East Coast data center. We do the replication, we build the live data platform for them, they run their standard analytics applications, be it Hadoop-sourced or some other analytics applications, they get consistent answers. Whether you run the analytics application on the East Coast or the West Coast, you will get the same exact answer. That is very valuable because if you are doing things like fault detection, you really don't want spurious detection because the data on the West Coast was not quite consistent and your analytics application was led astray. That's a great example. We also have another example with a top three bank that has a regulatory concern where they need to operate out of their backup data centers, so-called backup data center, once every three months or so. Now with live data, there is no notion of active data center and backup data center. All data centers are active, so this particular regulatory requirement is extremely simple for them to implement. They just run their queries on one of the other data centers and prove to the regulators that their data is indeed live. I could go on and on about a number of these. We also have a top two retailer who has got such a volume data that they cannot manage it in one Hadoop cluster. They use our technology to create the live data data link. >> One of the challenges always, customers love the idea of global but governance, compliance, things like GDPR pop up. Does that play into your world? Or is that a bit outside of what WANdisco sees? >> It actually turns out to be an important consideration for us because if you think about it, when we replicate the data flows through us. So we can be very careful about not replicating data that is not supposed to be replicated. We can also be very careful about making sure that the data is available in multiple regions within the same country if that is the requirement. So GDPR does play a big role in the reason why many of our customers, particularly in the financial industry, end up purchasing our software. >> Okay, so this new term live data, are there any other partners of yours that are involved in this? As always, you want like a bit of an ecosystem to help build out a wave. >> So our most important partners are the cloud vendors. And they're multi-region by nature. There is no idea of a single data center or a single region cloud, so Microsoft, Amazon with AWS, these are all important partners of ours, and they're promoting our live data platform as part of their strategy of building huge hybrid data lakes. >> All right, Jagane give us a little view looking forward. What should we expect to see with live data and WANdisco through the rest of 2018? >> Looking forward, we expect to see our footprint grow in terms with dealing with a variety of applications, all the way from batch, pig scripts that used to run once a day to hive that's maybe once every 15 minutes to data warehouses that are almost instant and queryable by human beings, to streaming data that pours things into Kafka. We see the whole footprint of analytics databases growing. We see cross-capability meaning perhaps an Amazon Redshift to an Azure or SQL EDW replication. Those things are very interesting to us, to our customers, because some of them have strengths in certain areas and other have strengths in other areas. Customers want to exploit both of those. So we see us as being the glue for all world-scale analytics applications. >> All right well, Jagane, I appreciate you sharing with us everything that's happening at WANdisco. This new idea of live data, we look forward to catching up with you and the team in the future and hearing more about the customers and everything on there. We'll be back with lots more coverage here from AWS Summit here in San Francisco. I'm Stu Miniman, you're watching theCUBE. (electronic music)

Published Date : Apr 4 2018

SUMMARY :

Brought to you by Amazon Web Services. and this is theCUBE's exclusive coverage data really is at the center of it, you know. and changing all the time. Is it the likes of Amazon and their global reach? The fact that you can start up a virtual machine about all the cloud dynamics and levers? but the essence of this is that by using and globally distributed systems in the past? ship the log, once the change is made to the primary. That means you can write to your on-prem storage, Yeah, what are you hearing from customers They're looking at growing in the cloud, At the beginning of the conversation we talked about data. or the West Coast, you will get the same exact answer. One of the challenges always, of our customers, particularly in the financial industry, As always, you want like a bit of an ecosystem So our most important partners are the cloud vendors. What should we expect to see with live data We see the whole footprint to catching up with you and the team in the future

ENTITIES

Entity	Category	Confidence
Amazon	ORGANIZATION	0.99+
Microsoft	ORGANIZATION	0.99+
AWS	ORGANIZATION	0.99+
Amazon Web Services	ORGANIZATION	0.99+
David Richards	PERSON	0.99+
Jagane	PERSON	0.99+
San Francisco	LOCATION	0.99+
Jagane Sundar	PERSON	0.99+
Stu Miniman	PERSON	0.99+
WANdisco	ORGANIZATION	0.99+
GDPR	TITLE	0.99+
Stu	PERSON	0.99+
One	QUANTITY	0.99+
East Coast	LOCATION	0.99+
Both	QUANTITY	0.99+
second	QUANTITY	0.99+
two	QUANTITY	0.98+
MySQL	TITLE	0.98+
West Coast	LOCATION	0.98+
two different types	QUANTITY	0.98+
one	QUANTITY	0.98+
both	QUANTITY	0.98+
one day	QUANTITY	0.98+
Kafka	TITLE	0.98+
S3	TITLE	0.97+
Moscone Center	LOCATION	0.97+
Oracle	ORGANIZATION	0.96+
once a day	QUANTITY	0.95+
Google Spanner	TITLE	0.95+
single data center	QUANTITY	0.95+
NameNode	TITLE	0.94+
hundreds of thousands	QUANTITY	0.94+
today	DATE	0.93+
theCUBE	ORGANIZATION	0.92+
Azure	TITLE	0.91+
WANdisco	TITLE	0.9+
snapdiff	TITLE	0.89+
SQL EDW	TITLE	0.89+
Redshift	TITLE	0.88+
single customer	QUANTITY	0.87+
AWS Summit	EVENT	0.87+
AWS Summit San Francisco 2018	EVENT	0.86+
single region	QUANTITY	0.85+
2018	DATE	0.84+
snapshot	TITLE	0.81+
Jagane	ORGANIZATION	0.76+
three bank	QUANTITY	0.74+
once every 15 minutes	QUANTITY	0.73+
European	LOCATION	0.73+
AWS Summit SF 2018	EVENT	0.71+
once	QUANTITY	0.7+
Cloud	TITLE	0.65+
every three months	QUANTITY	0.64+
GoldenGate	ORGANIZATION	0.57+
of cars	QUANTITY	0.55+
minute	QUANTITY	0.53+
Paxos	ORGANIZATION	0.53+
HTFS	TITLE	0.53+
Hive	TITLE	0.49+
Hadoop	ORGANIZATION	0.41+
BLOB	TITLE	0.4+

Jacques Nadeau, Dremio | Big Data SV 2018

>> Announcer: Live from San Jose, it's theCUBE, presenting Big Data Silicon Valley. Brought to you by SiliconANGLE Media and it's ecosystem partners. >> Welcome back to Big Data SV in San Jose. This theCUBE, the leader in live tech coverage. My name is Dave Vellante and this is day two of our wall-to-wall coverage. We've been here most of the week, had a great event last night, about 50 or 60 of our CUBE community members were here. We had a breakfast this morning where the Wikibon research team laid out it's big data forecast, the eighth big data forecast and report that we've put out, so check out that online. Jacques Nadeau is here. He is the CTO and co-founder of Dremio. Jacque, welcome to theCUBE, thanks for coming on. >> Thanks for having me here. >> So we were talking a little bit about what you guys do. Three year old company. Well, let me start. Why did you co-found Dremio? >> So, it was a very simple thing I saw, so, over the last ten years or so, we saw a regression in the ability for people to get at data, so you see all these really cool technologies that came out to store data. Data lakes, you know, SQL systems, all these different things that make developers very agile with data. But what we were also seeing was a regression in the ability for analysts and data consumers to get at that data because the systems weren't designed for analysts, they were designed for data producers and developers. And we said, you know what, there needs to be a way to solve this. We need to be able to empower people to be self-sufficient again at the data consumption layer. >> Okay, so you solved that problem how, you said, called it a self-service of a data platform. >> Yeah, yeah, so self-service data platform and the idea is pretty simple. It's that, no matter where the data is physically, people should be able to interact with a logical view of it. And so, we talk a little bit like it's Google Docs for your data. So people can go into the system, they can see the different data sets that are available to them, collaborate around those, create changes to those that they can then share with other people in the organization, always dealing with the logical layer and then, behind the scenes, we have physical capabilities to interact with all the different system we interact with. But that's something that business users shouldn't have to think as much about and so, if you think about how people interact with data today, it's very much about copies. So every time you want to do something, typically you're going to make a copy. I want to reshape the data, I make a copy. I want to make it go faster, I make a copy. And those copies are very, very difficult for people to manage and they could have mixed the business meaning of data with the physical, I'm making copies to make them faster or whatever. And so our perspective is that, if you can separate away the physical concerns from the logical, then business users have a much more, much more likelihood to be able to do something self-service. >> So you're essentially virtualizing my corpus of data, independent of location, is that right, I mean-- >> It's part of what we do, yeah. No, it's part of what we do. So, the way we look at it is, is kind of several different components to try to make something self-service. It starts with, yeah, virtualize or abstract away the details of the physical, right? But then, on top of that, expose a very, sort of a very user-friendly interface that allows people to sort of catalog and understand the different things, you know, search for things that they want to interact with, and then curate things, even if they're non-technical users, right? So the goal is that, if you talk to sort of even large internet companies in the Valley, it's very hard to even hire the amount of data engineering that you need to satisfy all the requests of your end-users of data. And so the, and so the goal of Dremio is basically to figure out different tools that can provide a non-technical experience for getting at the data. So that's sort of the start of it but then the second step is, once you've got access to this thing and people can collaborate and sort of deal with the data, then you've got these huge volumes of data, right? It's big data and so how do you make that go faster? And then we have some components that we deal with, sort of, speed and acceleration. >> So maybe talk about how people are leveraging this capability, this platform, what the business impact is, what have you seen there? >> So a lot of people have this problem, which is, they have data all over the place and they're trying to figure out "How do I expose this "to my end-users?" And those end-users might be analysts, they might be data scientists, they might be product managers that are trying to figure out how their product is working. And so, what they're doing today is they're typically trying to build systems internally that, to provide these capabilities. And so, for example, working with a large auto manufacturer. And they've got a big initiative where they're trying to make the data that they have, they have huge amounts of data across all sort of different parts of the organization and they're trying to make that available to different data consumers. Now, of course, there's a bunch of security concerns that you need to have around that, but they just want to make the data more accessible. And so, what they're doing is they're using Dremio to figure out ways to, basically, catalog all the data below, expose that to the different users, applying lots of different security rules around that, and then create a bunch of reflections, which make the things go faster as people are interacting with the things. >> Well, what about the governance factor? I mean, you heard this in the hadoop world years ago. "Ah, we're going to make, we're going to harden hadoop, "we're going to" and really, there was no governance and it became more and more important. How do you guys handle that? Do you partner with people? Is it up to the customer to figure that out? Do you provide that? >> It's several different things, right? It's a complex ecosystem, right? So it's a combination of things. You start with partnering with different systems to make sure that you integrate well with those things. So the different things that control some parts of credentials inside the systems all the way down to "What's the file system permissions?", right? "What are the permissions inside of something like Hive and the metastore there?" And then other systems on top of that, like Sentry or Ranger are also exposing different credentialing, right? And so we work hard to sort of integrate with those things. On top of that, Dremio also provides a full security model inside of the sort of virtual space that we work. And so people can control the permissions, the ability to access or edit any object inside of Dremio based on user roles and LDAP and those kinds of things. So it's, it's kind of multiple layers that have to be working together. >> And tell me more about the company. So founded three years ago, I think a couple of raises, >> Yep >> who's backing you? >> Yeah, yeah, yeah, so we founded just under three years ago. We had great initial investors, in Red Point and Lightspeed, so two great initial investors and we raised about 15 million on that round. And then we actually just closed a B round in January of this year and we added Norwest to the portfolio there. >> Awesome, so you're now in the mode of, I mean, they always say, you know, software is such a capital-efficient business but you see software companies raising, you know, 900 million dollars and so, presumably, that's to compete, to go to market and, you know, differentiate with your messaging and branding. Is that sort of what the, the phase that you're in now? You kind of developed a product, it's technically sound, it's proven in the marketspace and now you're scaling the, the go-to-market, is that right? >> That's exactly right. So, so we've had a lot of early successes, a lot of Fortune 100 companies using Dremio today. For example, we're working with TransUnion. We're working with Intel. We actually have a great relationship with OVH, which is the third-largest hosting company in the world, so a lot of great, Daimler is another one. So working with a lot of great companies, seeing sort of great early success with the product with those companies, and really looking to say "Hey, we're out here." We've got a booth for the first time at Strata here and we're sort of letting people know about, sort of, a better way, or easier way, for people to deal with data >> Yeah. >> A happier way. >> I mean, it's a crowded space, right? There's a lot of tools out there, a lot of companies. I'm interested in how you sort of differentiate. Obviously simplification is a part of that, the breadth of your capabilities. But maybe, in your words, you could share with me how you differentiate from the competition and how you break out from the noise. >> Yeah, yeah, yeah, so it's, you're absolutely right, it's a very crowded space. Everybody's using the same words and that makes it very hard for people to understand what's going on. And so, what we've found is very simple is that typically we will actually, the first meeting we deal with a customer, within the first 10 minutes we'll demo the product. Because so many technologies are technologies, not, they're not products and so you have to figure out how to use the product. You've got to figure out how you would customize it for your certain use-case. And what we've found with our product is, by making it very, very simple, people start, the light goes on in a very short amount of time and so, we also do things on our website so that you can see, in a couple of minutes, or even less than that, little animations that sort of give you a sense of what it's about. But really, it's just "Hey, this is a product "which is about", there's this light bulb that goes on, it's great. And you figure this out over the course of working with different customers, right? But there's this light bulb that goes on for people that are so confused by all the things that are going on and if we can just sit down with them, show them the product for a few minutes, all of a sudden they're like "Wait a minute, "I can use this", right? So you're frequently talking to buyers that are not the most technical parts of the organization initially, and so most of the technologies they look at are technologies that are very difficult to understand and they have to look to others to try to even understand how it would fit into their architecture. With Dremio, we have customers that can, that have installed it and gotten up, and within an hour or two, started to see real value. And that sort of excitement happens even in the demo, with most people. >> So you kind of have this bifurcated market. Since the big data meme, everybody says they're data-driven and you've got a bifurcated market in that, you've got the companies that are data-driven and you've got companies who say they're data-driven but really aren't. Who are your customers? Are they in both? Are they predominantly in the data-driven side? Are they predominantly in the trying to be data-driven? >> Well, I would say that they all would say that they're data-driven. >> Yeah, everyone, who's going to say "Well, we're not data-driven." >> Yeah, yeah, yeah. So I would say >> We're dead. >> I would say that everybody has data and they've got some ways that they're using it well and other places where they feel like they're not using it as well as they should. And so, I mean, the reason that we exist is to make it so it's easier for people to get value out of data, and so, if they were getting all the value they think they could get out of data, then we probably wouldn't exist and they would be fully data-driven. So I think that everybody, it's a journey and people are responding well to us, in part, because we're helping them down that journey. >> Well, the reason I asked that question is that we go to a lot of shows and everybody likes to throw out the digital transformation buzzword and then use Uber and Airbnb as an example, but if you dig deeper, you see that data is at the core of those companies and they're now beginning to apply machine intelligence and they're leveraging all this data that they've built up, this data architecture that they built up over the last five or 10 years. And then you've got this set of companies where all the data lives in silos and I can see you guys being able to help them. At the same time, I can see you helping the disruptors, so how do you see that? I mean, in terms of your role, in terms of affecting either digital transformations or digital disruptions. >> Well, I'd say that in either case, so we believe in a very sort of simple thing, which is that, so going back to what I said at the beginning, which is just that I see this regression in terms of data access, right? And so what happens is that, if you have a tightly-coupled system between two layers, then it becomes very difficult for people to sort of accommodate two different sets of needs. And so, the change over the last 10 years was the rise of the developer as the primary person for controlling data and that brought a huge amount of great things to it but analysis was not one of them. And there's tools that try to make that better but that's really the problem. And so our belief is very simple, which is that a new tier needs to be introduced between the consumers and the, and the producers of data. And that, and so that tier may interact with different systems, it may be more complex or whatever, for certain organizations, but the tier is necessary in all organizations because the analysts shouldn't be shaken around every time the developers change how they're doing data. >> Great. John Furrier has a saying that "Data is the new development kit", you know. He said that, I don't know, eight years ago and it's really kind of turned out to be the case. Jacques Nadeau, thanks very much for coming on theCUBE. Really appreciate your time. >> Yeah. >> Great to meet you. Good luck and keep us informed, please. >> Yes, thanks so much for your time, I've enjoyed it. >> You're welcome. Alright, thanks for watching everybody. This is theCUBE. We're live from Big Data SV. We'll be right back. (bright music)

Published Date : Mar 9 2018

SUMMARY :

Brought to you by SiliconANGLE Media We've been here most of the week, So we were talking a little bit about what you guys do. And we said, you know what, there needs to be a way Okay, so you solved that problem how, and the idea is pretty simple. So the goal is that, if you talk to sort of expose that to the different users, I mean, you heard this in the hadoop world years ago. And so people can control the permissions, And tell me more about the company. And then we actually just closed a B round that's to compete, to go to market and, you know, for people to deal with data and how you break out from the noise. and so most of the technologies they look at So you kind of have this bifurcated market. that they're data-driven. Yeah, everyone, who's going to say So I would say And so, I mean, the reason that we exist is At the same time, I can see you helping the disruptors, And so, the change over the last 10 years "Data is the new development kit", you know. Great to meet you. This is theCUBE.

ENTITIES

Entity	Category	Confidence
Dave Vellante	PERSON	0.99+
Jacques Nadeau	PERSON	0.99+
Daimler	ORGANIZATION	0.99+
John Furrier	PERSON	0.99+
Norwest	ORGANIZATION	0.99+
Intel	ORGANIZATION	0.99+
Wikibon	ORGANIZATION	0.99+
TransUnion	ORGANIZATION	0.99+
Jacque	PERSON	0.99+
San Jose	LOCATION	0.99+
OVH	ORGANIZATION	0.99+
Lightspeed	ORGANIZATION	0.99+
second step	QUANTITY	0.99+
Uber	ORGANIZATION	0.99+
two layers	QUANTITY	0.99+
Airbnb	ORGANIZATION	0.99+
both	QUANTITY	0.99+
SiliconANGLE Media	ORGANIZATION	0.99+
Google Docs	TITLE	0.99+
Red Point	ORGANIZATION	0.99+
Strata	ORGANIZATION	0.99+
60	QUANTITY	0.98+
900 million dollars	QUANTITY	0.98+
three years ago	DATE	0.98+
eight years ago	DATE	0.98+
two	QUANTITY	0.98+
Dremio	PERSON	0.98+
first 10 minutes	QUANTITY	0.98+
last night	DATE	0.98+
about 15 million	QUANTITY	0.97+
theCUBE	ORGANIZATION	0.97+
first time	QUANTITY	0.97+
Dremio	ORGANIZATION	0.97+
Big Data SV	ORGANIZATION	0.96+
an hour	QUANTITY	0.96+
two great initial investors	QUANTITY	0.95+
today	DATE	0.93+
first meeting	QUANTITY	0.93+
this morning	DATE	0.92+
two different sets	QUANTITY	0.9+
third	QUANTITY	0.88+
Big Data	ORGANIZATION	0.87+
SQL	TITLE	0.87+
10 years	QUANTITY	0.87+
CUBE	ORGANIZATION	0.87+
years ago	DATE	0.86+
Silicon Valley	LOCATION	0.86+
January of this year	DATE	0.84+
Dremio	TITLE	0.84+
Three year old	QUANTITY	0.81+
last 10 years	DATE	0.8+
Sentry	ORGANIZATION	0.77+
one of them	QUANTITY	0.75+
about 50	QUANTITY	0.75+
day two	QUANTITY	0.74+
Ranger	ORGANIZATION	0.74+
SV	EVENT	0.7+
last ten years	DATE	0.68+
eighth big	QUANTITY	0.68+
Data	ORGANIZATION	0.66+
Big	EVENT	0.65+
couple of minutes	QUANTITY	0.61+
CTO	PERSON	0.56+
one	QUANTITY	0.55+
last	DATE	0.52+
100 companies	QUANTITY	0.52+
under	DATE	0.51+
five	QUANTITY	0.5+
2018	DATE	0.5+
Hive	TITLE	0.42+

Kunal Agarwal, Unravel Data | Big Data SV 2018

>> Announcer: Live from San Jose, it's theCube! Presenting Big Data: Silicon Valley Brought to you by SiliconANGLE Media and its ecosystem partners. (techno music) >> Welcome back to theCube. We are live on our first day of coverage at our event BigDataSV. I am Lisa Martin with my co-host George Gilbert. We are at this really cool venue in downtown San Jose. We invite you to come by today, tonight for our cocktail party. It's called Forager Tasting Room and Eatery. Tasty stuff, really, really good. We are down the street from the Strata Data Conference, and we're excited to welcome to theCube a first-time guest, Kunal Agarwal, the CEO of Unravel Data. Kunal, welcome to theCube. >> Thank you so much for having me. >> So, I'm a marketing girl. I love the name Unravel Data. (Kunal laughs) >> Thank you. >> Two year old company. Tell us a bit about what you guys do and why that name... What's the implication there with respect to big data? >> Yeah, we are a application performance management company. And big data applications are just very complex. And the name Unravel is all about unraveling the mysteries of big data and understanding why things are not performing well and not really needing a PhD to do so. We're simplifying application performance management for the big data stack. >> Lisa: Excellent. >> So, so, um, you know, one of the things that a lot of people are talking about with Hadoop, originally it was this cauldron of innovation. Because we had the "let a thousand flowers bloom" in terms of all the Apache projects. But then once we tried to get it into operation, we discovered there's a... >> Kunal: There's a lot of problems. (Kunal laughs) >> There's an overhead, there's a downside to it. >> Maybe tell us, tell us why you both need to know, you need to know how people have done this many, many times. >> Yeah. >> How you need to learn from experience and then how you can apply that even in an environment where someone hasn't been doing it for that long. >> Right. So, if I back a little bit. Big data is powerful, right? It's giving companies an advantage that they never had, and data's an asset to all of these different companies. Now they're running everything from BI, machine learning, artificial intelligence, IOT, streaming applications on top of it for various reasons. Maybe it is to create a new product to understand the customers better, etc., But as you rightly pointed out, when you start to implement all of these different applications and jobs, it's very, very hard. It's because big data is very complex. With that great power comes a lot of complexity, and what we started to see is a lot of companies, while they want to create these applications and provide that differentiation to their company, they just don't have enough expertise as well in house to go and write good applications, maintain these applications, and even manage the underlying infrastructure and cluster that all these applications are running on. So we took it upon ourselves where we thought, Hey, if we simplify application performance management and if we simplify ongoing management challenges, then these companies would run more big data applications, they would be able to expand their use cases, and not really be fearful of, Hey, we don't know how to go and solve these problems. Do we actually rely on our system that is so complex and new? And that's the gap the Unravel fills, which is we monitor and manage not only one componenent of the big data ecosystem, but like you pointed out, it's a, it's a full zoo of all of these systems. You have Hadoop, and you have Spark, and you have Kafka for data injection. You may have some NoSQL systems and newer MPP platforms as well. So the vision of Unravel is really to be that one place where you can come in and understand what's happening with your applications and your system overall and be able to resolve those problems in an automatic, simple way. >> So, all right, let's start at the concrete level of what a developer might get out of >> Kunal: Right. >> something that's wrapped in Unravel and then tell us what the administrator experiences. >> Kunal: Absolutely. So if you are a big data developer you've got in a business requirement that, Hey, go and make this application that understands our customers better, right? They may choose a tool of their liking, maybe Hive, maybe Spark, maybe Kafka for data injection. And what they'll do is they'll write an app first in dev, in their dev environment or the QA environment. And they'll say, Hey, maybe this application is failing, or maybe this application is not performing as fast as I want it to, or even worse that this application is starting to hog a lot of resources, which may slow down my other applications. Now to understand what's causing these kind of problems today developers really need a PhD to go and decipher them. They have to look at tons of law rogs, uh, raw logs metrics, configuration settings and then try to stitch the story up in their head, trying to figure out what is the effect, what is the cause? Maybe it's this problem, maybe it's some other problem. And then do trial and error to try, you know to solving that particular issue. Now what we've seen is big data developers come in variety of flavors. You have the hardcore developers who truly understand Spark and Hadoop and everything, but then 80% of the people submitting these applications are data scientist or business analysts, who may understand SQL, who may know Python, but don't necessarily know what distributed computing and parallel processing and all of these things really are, and where can inefficiencies and problems really lie. So we give them this one view, which will connect all of these different data sources and then tell them in plain English, this is the problem, this is why this problem happened, and this is how you can go and resolve it, thereby getting them unstuck and making it very simple for them to go in and get the performance that they're getting. >> So, these, these, um, they're the developers up front and you're giving them a whole new, sort of, toolchain or environment to solve the operational issues. >> Kunal: Right. >> So that the, if it's DevOps, its really dev is much more sufficient. >> Yes, yes, I mean, all companies want to run fast. They don't want to be slowed down. If you have a problem today, they'll file a ticket, it'll go to the operations team, you wait a couple of days to get some more information back. That just means your business has slowed down. If things are simple enough where the application developers themselves can resolve a lot of these issues, that'll get the business unstuck and get them moving on further. Now, to the other point which you were asking, which is what about the operations and the app support people? So, Unravel's a great tool for them too because that helps them see what's happening holistically in the cluster. How are other applications behaving with each other? It's usually a multitenant, multiapplication environment that these big data jobs are running on. So, is my apps slowing down George's apps? Am I stealing resources from your applications? More so, not just about an individual application issue itself. So Unravel will give you visibility into each app, as well as the overall cluster to help you understand cluster-wide problems. >> Love to get at, maybe peel apart your target audience a little bit. You talked about DevOps. But also the business analysts, data scientists, and we talk about big data. Data is, has such tremendous power to fuel a company and, you know, like you said use it to deliver and, create and deliver new products. Are you talking with multiple audiences within a company? Do you start at DevOps and they bring in their peers? Or do you actually start, maybe, at the Chief Data Officer level? What's that kind of entrance for Unravel? >> So the word I use to describe this is DataOps, instead of DevOps, right? So in the older world you had developers, and you had operations people. Over here you have a data team and operations people, and that data team can comprise of the developers, the data scientists, the business analysts, etc., as well. But you're right. Although we first target the operations role because they have to manage and monitor the system and make sure everything is running like a well-oiled machine, they are now spreading it out to be end-users, meaning the developers themselves saying, "Don't come to me for every problem. "Look at Unravel, try solve it here, "and if you cannot, then come to me." This is all, again, improving agility within the company, making sure that people have the necessary tools and insights to carry on with their day. >> Sounds like an enabler, >> Yeah, absolutely. >> That operations would push down to the DevOp, the developers themselves. >> And even the managers and the CDOs, for example, they want to see their ROI that they're getting from their big data investments. They want to see, they have put in these millions of dollars, have got an infrastructure and these services set up, but how are we actually moving the needle forward? Are there any applications that we're actually putting in business, and is that driving any business value? So we will be able to give them a very nice dashboard helping them understand what kind of throughput are you getting from your system, how many applications were you able to develop last week and onboard to your production environment? And what's the rate of innovation that's really happening inside your company on those big data ecosystems? >> It sort of brings up an interesting question on two prongs. One is the well-known, but inexact number about how many big data projects, >> Kunal: Yeah, yeah. >> I don't know whether they fail or didn't pay off. So there's going in and saying, "Hey, we can help you manage this "because it was too complicated." But then there's also the, all the folks who decided, "Well, we really don't want "to run it all on-prem. "We're not going to throw away everything we did there, "but we're going to also put a lot of new investment >> Kunal: Exactly, exactly. >> in the cloud. Now, Wikibon has a term for that, which true private cloud, which is when you have the operational processes that you use in the public cloud and you can apply them on-prem. >> Right. >> George: But there's not many products that help you do that. How can Unravel work...? >> Kunal: That's a very good questions, George. We're seeing the world move more and more to a cloud environment, or I should say an on-demand environment where you're not so bothered about the infrastructure and the services, but you want Spark as a dial tone. You want Kafka as a dial tone. You want a machine-learning platform as a dial tone. You want to come in there, you want to put in your data, and you want to just start running it. Unravel has been designed from the ground up to monitor and manage any of these environments. So, Unravel can solve problems for your applications running on-premise and similarly all the applications that are running on cloud. Now, on the cloud there are other levels of problems as well so, of course, you'd have applications that are slow, applications that are failing; we can solve those problems. But if you look at a cloud environment, a lot of these now provide you an autoscaling capability, meaning, Hey, if this app doesn't run in the amount of time that we were hoping it to run, let's add extra hardware and run this application. Well, if you just keep throwing machines at the problem, it's not going to solve your issue. Now, it doesn't decrease the time that it will take linearly with how many servers that you're actually throwing in there, so what we can help companies understand is what is the resource requirement of a particular application? How should we be intelligently allocating resources to make sure that you're able to meet your time SLAs, your constraints of, here I need to finish this with x number of minutes, but at the same time be intelligent about how much cost you're spending over there. Do you actually need 500 containers to go and run this app? Well, you may have needed 200. How do you know that? So, Unravel will also help you get efficient with your run, not just faster, but also can it be a good multitenant citizen, can it use limited resources to actually run this applications as well? >> So, Kunal, some of the things I'm hearing from a customer's standpoint that are potential positive business outcomes are internal: performance boost. >> Kunal: Yeah. >> It also sounds like, sort of... productivity improvements internally. >> And then also the opportunity to have the insight to deliver new products, but even I'm thinking of, you know, helping make a retailer, for example, be able to do more targeted marketing, so >> the business outcomes and the impact that Unravel can make really seem to have pretty strong internal and external benefits. >> Kunal: Yes. >> Is there a favorite customer story, (Kunal laughs) don't have to mention names, that you really think speaks to your capabilities? >> So, 100% Improving performance is a very big factor of what Unravel can do. Decreasing costs by improving productivity, by limiting the amount of resources that you're using, is a very, very big factor. Now, amongst all of these companies that we work with, one key factor is improving reliability, which means, Hey, it's fine that he can speed up this application, but sometimes I know the latency that I expect from an app, maybe it's a second, maybe it's a minute, depending on the type of application. But what businesses cannot tolerate is this app taking five x amount more time today. If it's going to finish in a minute, tell me it'll finish in a minute and make sure it finishes in a minute. And this is a big use case for all of the big data vendors because a lot of the customers are moving from Teradata, or from Vertica, or from other relation databases, on to Hortonworks or Cloudera or Amazon EMR. Why? Because it's one tenth the amount of cost for running these workloads. But, all the customers get frustrated and say, "I don't mind paying 10 x more money, "but because over there it used to work. "Over here, there are just so many complications, "and I don't have reliability with these applications." So that's a big, big factor of, you know, how we actually help these customers get value out of the Unravel product. >> Okay, so, um... A question I'm, sort of... why aren't there so many other Unravels? >> Kunal: Yeah. (Kunal laughs) >> From what I understood from past conversations. >> Kunal: Yeah. >> You can only really build the models that are at the heart of your capabilities based on tons and tons of telemetry >> Kunal: Yeah. >> that cloud providers or, or, sort of, internet scale service providers have accumulated in that, because they all have sort of a well-known set of configurations and well-known kind of typology. In other words, there're not a million degrees of freedom on any particular side that you can, you have a well-scoped problem, and you have tons of data. So it's easier to build the models. So who, who else could do this? >> Yeah, so the difference between Unravel and other monitoring products is Unravel is not a monitoring product. It's an intelligent performance management suite. What that means is we don't just give you graphs and metrics and say, "Here are all the raw information, "you go figure it out." Instead, we have to take it a step further where we are actually giving people answers. In order to develop something like that, you need full stack information; that's number one. Meaning information from applications all the way down to infrastructure and everything in between. Why? Because problems can lie anywhere. And if you don't have that full stack info, you're blind-siding yourself, or limiting the scope of the problems that you can actually search for. Secondly is, like you were rightly pointing out, how do I create answers from all this raw data? So you have to think like how an expert with big data would think, which is if there is a problem what are the kinds of checks, balances, places that that person would look into, and how would that person establish that this is indeed the root cause of the problem today? And then, how would that person actually resolve this particular problem? So, we have a big team of scientists, researchers. In fact, my co-founder is a professor of computer science at Duke University who has been researching data-based optimization techniques for the last decade. We have about 80 plus publications in this area, Starfish being one of them. We have a bunch of other publications, which talk about how do you automate problem discovery, root cause analysis, as well as resolution, to get best performance out of these different databases? And you're right. A lot of work has gone on the research side, but a lot of work has gone in understanding the needs of the customers. So we worked with some of the biggest companies out there, which have some of the biggest big data clusters, to learn from them, what are some everyday, ongoing management challenges that you face, and then taking that problem to our datasets and figuring out, how can we automate problem discovery? How can we proactively spot a lot of these errors? I joke around and I tell people that we're big data for big data. Right? All these companies that we serve, they are gathering all of this data, and they're trying to find patterns, and they're trying to find, you know, some sort of an insight with their data. Our data is system generated data, performance data, application data, and we're doing the exact same thing, which is figuring out inefficiencies, problems, cause and effect of things, to be able to solve it in a more intelligent, smart way. >> Well, Kunal, thank you so much for stopping by theCube >> Kunal: Of course. >> And sharing how Unravel Data is helping to unravel the complexities of big data. (Kunal laughs) >> Thank you so much. Really appreciate it. >> Now you're a Cube almuni. (Kunal laughs) >> Absolutely. Thanks so much for having me. >> Kunal, thanks. >> Yeah, and we want to thank you for watching the Cube. I'm Lisa Martin with George Gilbert. We are live at our own event BigData SV in downtown San Jose, California. Stick around. George and I will be right back with our next guest. (quiet crowd noise) (techno music)

Published Date : Mar 8 2018

SUMMARY :

Brought to you by SiliconANGLE Media We invite you to come by today, I love the name Unravel Data. Tell us a bit about what you guys do and not really needing a PhD to do so. So, so, um, you know, one of the things that Kunal: There's a lot of problems. there's a downside to it. tell us why you both need to know, and then how you can apply that even in an environment of the big data ecosystem, but like you pointed out, and then tell us what the administrator experiences. and this is how you can go and resolve it, and you're giving them a whole new, sort of, So that the, if it's DevOps, Now, to the other point which you were asking, to fuel a company and, you know, like you said So in the older world you had developers, DevOp, the developers themselves. and is that driving any business value? One is the well-known, but inexact number "Hey, we can help you manage this and you can apply them on-prem. that help you do that. and you want to just start running it. So, Kunal, some of the things I'm hearing It also sounds like, sort of... that Unravel can make really seem to have So that's a big, big factor of, you know, A question I'm, sort of... and you have tons of data. What that means is we don't just give you graphs to unravel the complexities of big data. Thank you so much. Now you're a Cube almuni. Thanks so much for having me. Yeah, and we want to thank you

ENTITIES

Entity	Category	Confidence
George Gilbert	PERSON	0.99+
Lisa Martin	PERSON	0.99+
Kunal Agarwal	PERSON	0.99+
George	PERSON	0.99+
Kunal	PERSON	0.99+
Lisa	PERSON	0.99+
80%	QUANTITY	0.99+
Hortonworks	ORGANIZATION	0.99+
100%	QUANTITY	0.99+
Vertica	ORGANIZATION	0.99+
Unravel Data	ORGANIZATION	0.99+
Teradata	ORGANIZATION	0.99+
today	DATE	0.99+
500 containers	QUANTITY	0.99+
One	QUANTITY	0.99+
Two year	QUANTITY	0.99+
two prongs	QUANTITY	0.99+
last week	DATE	0.99+
SiliconANGLE Media	ORGANIZATION	0.99+
tonight	DATE	0.99+
200	QUANTITY	0.99+
first day	QUANTITY	0.99+
San Jose	LOCATION	0.99+
Spark	TITLE	0.99+
Cloudera	ORGANIZATION	0.99+
each app	QUANTITY	0.99+
Python	TITLE	0.98+
a minute	QUANTITY	0.98+
English	OTHER	0.98+
one	QUANTITY	0.98+
Duke University	ORGANIZATION	0.98+
five	QUANTITY	0.98+
Kafka	TITLE	0.98+
Hadoop	TITLE	0.98+
BigData SV	EVENT	0.97+
first-time	QUANTITY	0.97+
Strata Data Conference	EVENT	0.97+
one key factor	QUANTITY	0.96+
millions of dollars	QUANTITY	0.95+
about 80 plus publications	QUANTITY	0.95+
SQL	TITLE	0.95+
DevOps	TITLE	0.94+
first	QUANTITY	0.94+
BigDataSV	EVENT	0.94+
tons and tons	QUANTITY	0.94+
both	QUANTITY	0.94+
Unravel	ORGANIZATION	0.93+
Secondly	QUANTITY	0.91+
million degrees	QUANTITY	0.91+
San Jose, California	LOCATION	0.91+
Hive	TITLE	0.91+
last decade	DATE	0.91+
Unravel	TITLE	0.9+

Matthew Baird, AtScale | Big Data SV 2018

>> Announcer: Live from San Jose. It's theCUBE, presenting Big Data Silicon Valley. Brought to you by SiliconANGLE Media, and it's ecosystem partners. (techno music) >> Welcome back to theCUBE, our continuing coverage on day one of our event, Big Data SV. I'm Lisa Martin with George Gilbert. We are down the street from the Strata Data Conference. We've got a great, a lot of cool stuff going on. You can see the cool set behind me. We are at Forager Tasting Room & Eatery. Come down and join us, be in our audience today. We have a cocktail event tonight, who doesn't want to join that? And we have a nice presentation tomorrow morning of our Wikibon's 2018 Big Data Forecast and Review. Joining us next is Matthew Baird the co-founder of AtScale. Matthew, welcome to theCUBE. >> Thanks for having me. Fantastic venue, by the way. >> Isn't it cool? >> This is very cool. >> Yeah, it is. So, talking about Big Data, you know, Gardner says, "85% of Big Data projects have failed." I often say failure is not a bad F word, because it can spawn the genesis of a lot of great business opportunities. Data lakes were big a few years ago, turned into swamps. AtScale has this vision of Data Lake 2.0, what is that? >> So, you're right. There have been a lot of failures, there's no doubt about it. And you're also right that is how we evolve, and we're a Silicon Valley based company. We don't give up when faced with these things. It's just another way to not do something. So, what we've seen and what we've learned through our customers is they need to have a solution that is integrated with all the technologies that they've adopted in the enterprise. And it's really about, if you're going to make a data lake, you're going to have data on there that is the crown jewels of your business. How are you going to get that in the hands of your constituents, so that they can analyze it, and they can use it to make decisions? And how can we, furthermore, do that in a way that supplies governance and auditability on top of it, so that we aren't just sending data out into the ether and not knowing where it goes? We have a lot of customers in the insurance, health insurance space, and with financial customers that the data absolutely must be managed. I think one of the biggest changes is around that integration with the current technologies. There's a lot of movement into the Cloud. The new data lake is kind of focused more on these large data stores, where it was HDFS with Hadoop. Now it's S3, Google's object storage, and Azure ADLS. Those are the sorts of things that are backing the new data lake I believe. >> So if we take these, where the Data Lake Store didn't have to be something that's a open source HDFS implementation, it could even be through just through a HDSF API. >> Matthew: Yeah, absolutely. >> What are some of the, how should we think about the data sources and feeds, for this repository, and then what is it on top that we need to put to make the data more consumable? >> Yeah, that's a good point. S3, Google Object Storage, and Azure, they all have a characteristic of, they are large stores. You can store as much as you want. They generally on the Clouds, and in the open source on-prem software for landing the data exists, for streaming the data and landing it, but the important thing there is it's cost-effective. S3 is a cost-effective storage system. HDFS is a mostly cost-effective storage system. You have to manage it, so it has a slightly higher cost, but the advice has been, get it to the place you're going to store it. Store it in a unified format. You get a halo effect when you have a unified format, and I think the industry is coalescing around... I'd probably say ParK's in the lead right now, but once ParK can be read by, let's take Amazon for instance, can be read by Athena, can be read by Redshift Spectrum, it can be read by their EMR, now you have this halo effect where your data's always there, always available to be consumed by a tool or a technology that can then deliver it to your end users. >> So when we talk about ParK, we're talking about columnar serialization format, >> Matthew: Yes. but there's more on top of that that needs to be layered, so that you can, as we were talking about earlier, combine the experience of a data warehouse, and the curated >> Absolutely data access where there's guard rails, >> Matthew: Yes >> and it's simple, versus sort of the wild west, but where I capture everything in a data lake. How do you bring those two together? >> Well, specifically for AtScale, we allow you to integrate multiple data access tools in AtScale, and then we use the appropriate tool to access the data for the use case. So let me give you an example, in the Amazon case, Redshift is wonderful for accessing interactive data, which BI users want, right? They want fast queries, sub-second queries. They don't want to pay to have all the raw data necessarily stored in Redshift 'cause that's pretty expensive. So they have this Redshift spectrum, it's sitting in S3, that's cost effective. So when we go and we read raw data to build these summary tables, to deliver the data fast, we can read from Spectrum, we can put it all together, drop it into Redshift, a much smaller volume of data, so it has faster characteristics for being accessed. And it delivers it to the user that way. We do that in Hadoop when we access via Hive for building aggregate tables, but Spark or Impala, is a much faster interactive engine, so we use those. As I step back and look at this, I think the Data Lake 2.0, from a technical perspective is about abstraction, and abstraction's sort of what separates us from the animals, right? It's a concept where we can pack a lot of sophistication and complexity behind an interface that allows people to just do what they want to do. You don't know how, or maybe you do know how a car engine works, I don't really, kind of, a little bit, but I do know how to press the gas pedal and steer. >> Right. >> I don't need to know these things, and I think the Data Lake 2.0 is about, well I don't need to know how Century, or Ranger, or Atlas, or any of these technologies work. I need to know that they're there, and when I access data, they're going to be applied to that data, and they're going to deliver me the stuff that I have access to and that I can see. >> So a couple things, it sounded like I was hearing abstraction, and you said really that's kind of the key, that sounds like a differentiator for AtScale, is giving customers that abstraction they need. But I'm also curious from a data value perspective, you talked about in Redshift from an expense perspective. Do you also help customers gain abstraction by helping them evaluate value of data and where they ought to keep it, and then you give them access to it? Or is that something that they need to do, kind of bring to the table? >> We don't really care, necessarily, about the source of the data, as long as it can be expressed in a way that can be accessed by whatever engine it is. Lift and shift is an example. There's a big move to move from Teradata or from Netezza into a Cloud-based offering. People want to lift it and shift it. It's the easiest way to do this. Same table definitions, but that's not optimized necessarily for the underlying data store. Take BigQuery for example, BigQuery's an amazing piece of technology. I think there's nothing like it out there in the market today, but if you really want BigQuery to be cost-effective, and perform and scale up to concurrency of... one of our customers is going to roll out about 8,000 users on this. You have to do things in BigQuery that are BigQuery-friendly. The data structures, the way that you store the data, repeated values, those sorts of things need to be taken into consideration when you build your schema out for consumption. With AtScale they don't need to think about that, they don't need to worry about it, we do it for them. They drop the schema in the same way that it exists on their current technology, and then behind the scenes, what we're doing is we're looking at signals, we're looking at queries, we're looking at all the different ways that people access the data naturally, and then we restructure those summary tables using algorithms and statistics, and I think people would broadly call it ML type approaches, to build out something that answers those questions, and adapts over time to new questions, and new use cases. So it's really about, imagine you had the best data engineering team in the world, in a box, they're never tired, they never stop, and they're always interacting with what the customers really want, which is "Now I want to look at the data this way". >> It's sounds actually like what your talking about is you have a whole set of sources, and targets, and you understand how they operate, but why I say you, I mean your software. And so that you can take data from wherever it's coming in, and then you apply, if it's machine learning or whatever other capabilities to learn from the access methods, how to optimize that data for that engine. >> Matthew: Exactly. >> And then the end users have an optimal experience and it's almost like the data migration service that Amazon has, it's like, you give us your Postgres or Oracle database, and we'll migrate it to the cloud. It sounds like you add a lot of intelligence to that process for decision support workloads. >> Yes. >> And figure out, so now you're going to... It's not Postgres to Postgres, but it might be Teradata to Redshift, or S3, that's going to be accessed by Athena or Redshift, and then let's put that in the right format. >> I think you sort of hit something that we've noticed is very powerful, which is if you can set up, and we've done this with a number of customers, if you can set up at the abstraction layer that is AtScale, on your on-prem data, literally in, say hours, you can move it into the Cloud, obviously you have to write the detail to move it into the Cloud, but once it's in the Cloud you take the same AtScale instance, you re-point it at that new data source, and it works. We've done that with multiple customers, and it's fast and effective, and it let's you actually try out things that you may not have the agility to do before because there's differences in how the SQL dialects work, there's differences in, potentially, how the schema might be built. >> So a couple things I'm interested in, I'm hearing two A-words, that abstraction that we've talked about a number of times, you also mention adaptability. So when you're talking with customers, what are some of the key business outcomes they need to drive, where adaptability and abstraction are concerned, in terms of like cost reduction, revenue generation. What are some of those see-swee business objectives that AtScale can help companies achieve? >> So looking at, say, a customer, a large retailer on the East Coast, everybody knows the stores, they're everywhere, they sell hardware. they have a 20-terabyte cube that they use for day-to-day revenue analytics. So they do period over period analysis. When they're looking at stores, they're looking at things like, we just tried out a new marketing approach... I was talking to somebody there last week about how they have these special stores where they completely redo one area and just see how that works. They have to be able to look at those analytics, and they run those for a short amount of time. So if you're window for getting data, refreshing data, building cubes, which in the old world could take a week, you know my co-founder at Yahoo, he had a week and a half build time. That data is now two weeks old, maybe three weeks old. There might be bugs in it-- >> And the relevance might be, pshh... >> And the relevance goes down, or you can't react as fast. I've been at companies where... Speed is so important these days, and the new companies that are grasping data aggressively, putting it somewhere where they can make decisions on it on a day-to-day basis, they're winning. And they're spending... I was at a company that was spending three million dollars on pay-per-click data, a month. If you can't get data everyday, you're on the wrong campaigns, and everything goes off the rails, and you only learn about it a week later, that's 25% of your spend, right there, gone. >> So the biggest thing, sorry George, it really sounds to me like what AtScale can facilitate for probably customers in any industry is the ability to truly make data-driven business decisions that can really directly affect revenue and profit. >> Yes, and in an agile format. So, you can build-- >> That's the third A; agile, adaptability, abstraction. >> There ya go, the three A's. (Lisa laughs) We had the three V's, now we have the three A's. >> Yes. >> The fact that you're building a curated model, so in retail the calendars are complex. I'm sure everybody that uses Tableau is good at analyzing data, but they might not know what your rules are around your financial calendar, or around the hierarchies of your product. There's a lot of things that happen where you want an enterprise group of data modelers to build it, bless it, and roll it out, but then you're a user, and you say, wait, you forgot x, y, and z, I don't want to wait a week, I don't want to wait two weeks, three weeks, a month, maybe more. I want that data to be available in the model an hour later 'cause that's what I get with Tableau today. And that's where we've taken the two approaches of enterprise analytics and self-service, and tried to create a scenario where you get the best of both worlds. >> So, we know that an implication of what you're telling us is that insights are perishable, and latency is becoming more and more critical. How do you plan to work with streaming data where you've got a historical archive, but you've got fresh data coming in? But fresh could mean a variety of things. Tell us what some of those scenarios look like. >> Absolutely, I think there's two approaches to this problem, and I'm seeing both used in practice, and I'm not exactly sure, although I have some theories on which one's going to win. In one case, you are streaming everything into, sort of a... like I talked about, this data lake, S3, and you're putting it in a format like ParK, and then people are accessing it. The other way is access the data where it is. Maybe it's already in, this is a common BI scenario, you have a big data store, and then you have a dimensional data store, like Oracle has your customers, Hadoop has machine data about those customers accessing on their mobile devices or something. If there was some way to access those data without having to move the Oracle stuff into the big data store, that's a Federation story that I think we've talked about in the Bay Area for a long time, or around the world for a long time. I think we're getting closer to understanding how we can do that in practice, and have it be tenable. You don't move the big data around, you move the small data around. For data coming in from outside sources it's probably a little bit more difficult, but it is kind of a degenerate version of the same story. I would say that streaming is gaining a lot of momentum, and with what we do, we're always mapping, because of the governance piece that we've built into the product, we're always mapping where did the data come from, where did it land, and how did we use it to build summary tables. So if we build five summary tables, 'cause we're answering different types of questions, we still need to know that it goes back to this piece of data, which has these security constraints, and these audit requirements, and we always track it back to that, and we always apply those to our derived data. So when you're accessing this automatically ETLed summary tables, it just works the way it is. So I think that there are two ways that this is going to expand and I'm excited about Federation because I think the time has come. I'm also excited about streaming. I think they can serve two different use cases, and I don't actually know what the answer will be, because I've seen both in customers, it's some of the biggest customers we have. >> Well Matthew thank you so much for stopping by, and four A's, AtScale can facilitate abstraction, adaptability, and agility. >> Yes. Hashtag four A's. >> There we go. I don't even want credit for that. (laughs) >> Oh wow, I'm going to get five more followers, I know it! (George laughs) >> There ya go! >> We want to thank you for watching theCUBE, I am Lisa Martin, we are live in San Jose, at our event Big Data SV, I'm with George Gilbert. Stick around, we'll be back with our next guest after a short break. (techno music)

Published Date : Mar 7 2018

SUMMARY :

Brought to you by SiliconANGLE Media, We are down the street from the Strata Data Conference. Thanks for having me. because it can spawn the genesis that is the crown jewels of your business. So if we take these, that can then deliver it to your end users. and the curated and it's simple, versus sort of the wild west, And it delivers it to the user that way. and they're going to deliver me the stuff and then you give them access to it? The data structures, the way that you store the data, And so that you can take data and it's almost like the data migration service but it might be Teradata to Redshift, and it let's you actually try out things they need to drive, and just see how that works. And the relevance goes down, or you can't react as fast. is the ability to truly make data-driven business decisions Yes, and in an agile format. We had the three V's, now we have the three A's. where you get the best of both worlds. How do you plan to work with streaming data and then you have a dimensional data store, and four A's, AtScale can facilitate abstraction, Yes. I don't even want credit for that. We want to thank you for watching theCUBE,

ENTITIES

Entity	Category	Confidence
Matthew	PERSON	0.99+
George Gilbert	PERSON	0.99+
Lisa Martin	PERSON	0.99+
Matthew Baird	PERSON	0.99+
George	PERSON	0.99+
San Jose	LOCATION	0.99+
Yahoo	ORGANIZATION	0.99+
three weeks	QUANTITY	0.99+
Amazon	ORGANIZATION	0.99+
25%	QUANTITY	0.99+
Gardner	PERSON	0.99+
two approaches	QUANTITY	0.99+
Oracle	ORGANIZATION	0.99+
two weeks	QUANTITY	0.99+
Redshift	TITLE	0.99+
S3	TITLE	0.99+
three million dollars	QUANTITY	0.99+
two ways	QUANTITY	0.99+
Silicon Valley	LOCATION	0.99+
one case	QUANTITY	0.99+
85%	QUANTITY	0.99+
last week	DATE	0.99+
a month	QUANTITY	0.99+
Century	ORGANIZATION	0.99+
SiliconANGLE Media	ORGANIZATION	0.99+
a week	QUANTITY	0.99+
BigQuery	TITLE	0.99+
both	QUANTITY	0.99+
20-terabyte	QUANTITY	0.99+
Google	ORGANIZATION	0.99+
a week and a half	QUANTITY	0.99+
a week later	DATE	0.99+
Data Lake 2.0	COMMERCIAL_ITEM	0.99+
two	QUANTITY	0.99+
tomorrow morning	DATE	0.99+
AtScale	ORGANIZATION	0.99+
Atlas	ORGANIZATION	0.99+
Bay Area	LOCATION	0.98+
Lisa	PERSON	0.98+
ParK	TITLE	0.98+
Tableau	TITLE	0.98+
five more followers	QUANTITY	0.98+
an hour later	DATE	0.98+
Ranger	ORGANIZATION	0.98+
Netezza	ORGANIZATION	0.98+
tonight	DATE	0.97+
today	DATE	0.97+
both worlds	QUANTITY	0.97+
about 8,000 users	QUANTITY	0.97+
theCUBE	ORGANIZATION	0.97+
Strata Data Conference	EVENT	0.97+
one	QUANTITY	0.97+
Big Data SV 2018	EVENT	0.97+
Teradata	ORGANIZATION	0.96+
AtScale	TITLE	0.96+
Big Data SV	EVENT	0.93+
East Coast	LOCATION	0.93+
Hadoop	TITLE	0.92+
two different use cases	QUANTITY	0.92+
day one	QUANTITY	0.91+
one area	QUANTITY	0.91+

Gabe Monroy, Microsoft Azure | KubeCon 2017

>> Commentator: Live from Austin, Texas, it's the Cube. Covering KubeCon and CloudNativeCon 2017. Brought to you by Red Hat, the Linux foundation, and the Cube's ecosystem partners. >> Hey welcome back everyone. Live here in Austin, Texas the Cube's exclusive coverage of KubeCon and CloudNativeCon, its third year, not even third year I think it's second year and not even three years old as a community, growing like crazy. Over 4500 people here. Combined the bulk of the shows it's double than it was before. I'm John Ferrier, co-founder of SiliconANGLE. Stu Miniman, analysts here. Next is Gabe Monroy who was lead p.m. product manager for containers for Microsoft Azure, Gabe welcome to the Cube. >> Thanks, glad to be here. Big fan of the show. >> Great to have you on. I mean obviously container madness we've gotten past that now it's Kubernetes madness which really means that the evolution of the industry is really starting to get some clear lines of sight as a straight and narrow if you will people starting to see a path towards scale, developer acceleration, more developers coming in than ever before, this cloud native world. Microsoft's doing pretty well with the cloud right now. Numbers are great, hiring a bunch of people, give us a quick update big news what's going on? >> Yeah so you know a lot of things going on. I'm just excited to be here, I think for me, I'm new to Microsoft right. I came here about seven months ago by way of a Dais acquisition and I like to think of myself as kind of representing part of this new Microsoft trend. My career was built on open source. I started a company called Dais and we were focused on really Kubernetes based solutions and here at Microsoft I'm really doing a lot of the same thing but with Microsoft's Cloud as sort of the vehicle that we're trying to attract developers to. >> What news do you guys have here, some services? >> Yeah so we got a bunch of things, we're talking about so the first is something I'm especially excited about. So this is the virtual kubelet. Now, tell a little bit of story here, I think it's actually kind of fascinating, so back in July we launched this thing called Azure Container Instances and what ACI was first of its kind service containers in the cloud. Just run a container, runs in the cloud. It's micro build and it is invisible infrastructure, so part of the definition of serverless there. As part of that we want to make it clear that if you were going to do complex things with these containers you really need an orchestrator so we released this thing called the ACI Connector for Kubernetes along with it. And we were excited to see people just were so drawn its idea of serverless Kubernetes, Kubernetes that you know didn't have any VMs associated with it and folks at hyper.sh, who have a similar service container offering, they took our code base and forked it and did a version of theirs and you know Brent and I were thinking together when we were like "oh man there's something here, we should explore this" and so we got some engineers together, we put a lot of work together and we announced now, this in conjunction with hyper and others, this virtual kubelet that bridges the world of Kubernetes with the world of these new serverless container runtimes like ACI. >> Okay, can you explain that a little bit. >> Sure. >> People have been coming in saying wait does serverless replace, how does it work, is Kubernetes underneath still? >> Yeah so I think the best place to start is the definition of serverless and I think serverless is really the conflation of three things: it's invisible infrastructure, it is micro billing, and it is an event based programming model. It's sort of the classical definition right. Now what we did with ACI and serverless containers is we took that last one, the event based programming model, and we said look you don't need to do that. If you want to write a container, anything that runs in that container can work, not just functions and so that is I think a really important distinction that I believe it's really the best of serverless is you know that micro billing and invisible infrastructure. >> Well that's built in isn't it? >> Correct yeah. >> What are the biggest challenges of serverless because first of all its [Inaudible 00:03:58] in the mind of a developer who doesn't want to deal with plumbing. >> Yes. >> Meaning networking plumbing, storage, and a lot of the details around configurating, just program away, be creative, spend their time building. >> Yes. >> What is the big differences between that? What are the issues and challenges that service has for people adopting it or is it frictionless at this point? >> Well you know as far I mean it depends on what you're talking about right. So I think you know for functions you know it's very simple to you know get a function service and add your functions and deploy functions and start chaining those together and people are seeing rapid adoption and that's progressing nicely but there's also a contingent of folks who are represented here at the show who are really interested in containers as the primitive and not functions right. Containers are inclusive of lots of things, functions being one of them, betting on containers as like the compute artifact is actually a lot more flexible and solves a lot more use cases. So we're making sure that we can streamline ease of use for that while also bringing the benefits of serverless, really the way I think of this is marrying our AKS, our Managed Kubernetes Service with ACI, our you know serverless containers so you can get to a place where you can have a Kubernetes environment that has no VMs associated with it like literally zero VMs, you'd scale the thing down to zero and when you want to run a pod or container you just pay for a few seconds of time and then you kill it and you stop paying for it right. >> Alright so talk about customers. >> Yep. >> What's the customer experience you guys are going after, did you have any beta customers, who's adopting your approach, and can highlight some examples of some really cool and you don't have to name names or you can, anecdotal data will be good. >> Yeah well you know I think on the blog post announcement blog post page we have a really great video of Siemens Health and Years, I believe is the name, but basically a health care company that is looking, that is using Kubernetes on Azure, AKS specifically, to disrupt the health care market and to benefit real people and you know to me I think it's important that we remember that we're deep in this technology right but at the end of the day this is about helping developers who are in turn helping real world people and I think that video is a good example of that. >> An what was there impact, speed? Speed of developers? >> Yeah, I mean I think it's really the main thing is agility right, people want to move faster right and so that's the main benefit that we hear. I think cost is obviously a concern for folks but I think in practice the people cost of operating some of these systems is tends to be a lot higher than the infrastructure costs when you stack them up, so people are willing to pay a little bit of a premium to make it easier on people and we see that over and over again. >> Yeah Gabe, want you to speak to kind of the speed of company the size of Microsoft. So you know the Dais acquisition of course was already focused on Kubernetes before inside of Microsoft, see I mean big cloud companies moving really fast on Kubernetes. I've heard complaints from customers like "I can't get a good roadmap because it's moving so fast". >> You know I would say that was one of the biggest surprises for me joining Microsoft, is just how fast things move inside of Azure in particular. And I think it's terrific you know. I think that there's a really good focus of making sure that we're meeting customers where they are and building solutions that meet the market but also just executing and delivering and doing that with speed. One of the things that is most interesting to me is like the geographic spread. Microsoft is in so many different regions more than any other cloud. Compliance certification, we take to all that stuff really seriously and being able to do all those things, be the enterprise friendly cloud while also moving at this breakneck pace in terms of innovation, it's really spectacular to watch from the inside. >> A lot of people don't know that. When they think about Azure they think "oh they're copying Amazon" but Microsoft has tons of data centers. They've had browsers, they're all over the world, so it's not like they're foreign to region areas I mean they're everywhere. >> Microsoft is ever and not only is it not foreign but I mean you got to remember Microsoft is an enterprise software company at its core. We know developers, that is what we do and going into cloud in this way is just it's extremely natural for us. And I think that the same can't really be said for everyone who's trying to move into cloud. Like we've got history of working with developers, building platforms, we've entire division devoted to developer tooling right. >> I want to ask you about two things that comes up a lot, one is very trendy, one is kind of not so trendy but super important, one is AI. >> Yes. >> AI with software units impact disrupt storage and with virtual kubelets this is going to be changing storage game buts going to enhance the machine learning and AI capability. The other one is data warehousing or data analytics. Two very important trends, one is certainly a driver for growth and has a lot of sex appeal as the AI machine learning but all the analytics being done on cloud whether it's an IOT device, this is like a nice use case for containers and orchestration. Your comment and reaction for those two trends. >> Yeah and you know I think that AI and deep learning generally is something that we see driving a ton of demand for container orchestration. I've worked lots of customers including folks like OpenAI on there Kubernetes infrastructure running on a Azure today. Something that Elon Musk actually proudly mention, that was a good moment for the containers (chuckling) >> Get a free Tesla. Brokerage some Teslas and get that new one, goes from 0 to 100 and 4.5 seconds. >> Right yeah. >> So you got a good customer, OpenAI, what was the impact of them? What was the big? >> Well you know this is ultimately about empowering people, in this case they happen to be data scientists, to get their job done in a way where I mean I look at it has we're doing our jobs in the infrastructure space if the infrastructure disappears. The more conceptual overhead we're bringing to developers that means we're not doing our job. >> So question then specifically is deep learning in AI, is it enhanced by containers and Kubernetes? >> Absolutely. >> What order of magnitude? >> I don't know but in order of magnitude in enhancement I would argue. >> Just underlying that the really important piece is we're talking about data here >> Yes. >> and one of the things we've been kind of trying to tackle the last couple years of containers is you know storage and that's carried over to Kubernetes, how's Microsoft involved? What's you're you know prognosis as to where we go with cloud native storage? >> Yeah that's a fascinating question and I actually, so back in the early days when I was still contributing to Docker, I was one of the largest external contributors to the Docker Project earlier in my career. I actually wrote some of the storage stuff and so I've been going around Dockers inception 2013 saying don't run databases in containers. It's not cause you can't, right, you can, but just because you can doesn't mean you should (chuckling) >> Exactly. >> and I think that you know as somebody who has worked in my career as on the operation side things like an SLA mean a lot and so this leads me to another one of our announcements at the show which is the Open Service Broker for Azure. Now what we've done, thanks to the Cloud Foundry Foundation who basically took the service broker concept and spun it out, we now are able to take the world of Kubernetes and bridge it to the world of Azure services, data services being sort of some of the most interesting. Now the demo that I like to show this is WordPress which by the way sounds silly but WordPress powers tons of the web today still. WordPress is a PHP application and a MySQL database. Well if you're going to run WordPress at scale you're going to want to run that MySQL in a container? Probably not, you're probably going to want to use something like Azure database for MySQL which comes with an SLA, backup/restore, DR, ops team by Microsoft to manage the whole thing right. So but then the question is well I want to use Kubernetes right so how do I do that right, well with the Open Service Broker for Azure we actually shipped a helm chart. We can helm install Azure WordPress and it will install in Kubernetes the same way you would a container based system and behind the scenes it uses the broker to go spin up a Postgres, sorry a MySQL and dynamically attach it. Now the coolest thing to me about this yeah is the agility but I think that one of the underrated features is the security. The developer who does that doesn't ever touch credentials, the passwords are automatically generated and automatically injected into the application so you get to do things with rotations without ever touching the app. >> So we're at publisher we use WordPress, we'd love, will this help us with scale if we did Azure? >> Absolutely. After this is over we'll go set it up. (laughing) >> I love WordPress but when it breaks down well this is the whole point of where auto scaling shows a little bit of its capabilities in the world is that, PHP does you'd like to have more instances >> Yeah. >> that would be a use case. Okay Redshift in Amazon wasn't talking about much at re:Invent last week. We don't hear a lot of talk around the data warehouse which is a super important way to think about collecting data in cloud and is that going to be an enhanced feature because people want to do analytics. There's a huge analytics audience out there, they're moving off of tera-data. They're doing you guys have a lot of analytics at Microsoft. They might have moved from Hadoop or Hive or somewhere else so there's a lot of analytics workloads that would be prime or at least potentially prime for Kubernetes. >> Yeah I think >> Or is that not fully integrated. >> No I think it's interesting, I mean for us we look at, I personally think using something like the service broker, Open Service Broker API to bridge to something like a data lake or some of these other Azure hosted services is probably the better way of doing that because if you're going to run it on containers, these massive data warehouses, yes you can do it, but the operational burden is high, >> So your point about the >> its really high. >> database earlier. >> Yeah. Same general point there. Now can you do it? Do we see people doing it? Absolutely right. >> Yeah, they do you things sometimes that they shouldn't be doing. >> Yeah and of course back to the deep learning example those are typically big large training models that have similar characteristics. >> Alright as a newbie inside Azure, not new to the industry and the community, >> Yep. >> share some color. What's it like in there? Obviously a number two to Amazon, you guys have great geography presence, you're adding more and more services every day at Azure, what's the vibe, what's the mojo like over there, and share some inside baseball. >> Yeah I got to say so really I'm just saying it's a really exciting place to work. Things are moving so fast, we're growing so fast, customers really want what we're building. Honestly day to day I'm not spending a lot of time looking out I'm spending a lot of time dealing with enterprises who want to use our cloud products. >> And one of the top things that you have on your p.m. list that are the top stack ranked features people want? >> I think a lot of this comes down, in general I think this whole space is approaching a level of enterprise friendliness and enterprise hardening where we want to start adding governance, and adding security, and adding role based access controls across the board and really making this palatable to high trust environment. So I think a lot that's a lot of our focus. >> Stability, ease of use. >> Stability, ease of use are always there. I think the enterprise hardening and things like v-net support for all of our services, v-net service endpoints, those are some things that are high on the list. >> Gabe Monroy, lead product manager for containers at Microsoft Azure Cloud. Great to have you on and love to talk more about geographies and moving apps around the network and multi-cloud but another time, thanks for the time. >> Another time. >> It's the Cube live coverage I'm John Ferrier co-founder of [Inaudible 00:15:21]. Stu Miniman with Wikibon, back with more live coverage after this short break.

Published Date : Dec 7 2017

SUMMARY :

and the Cube's ecosystem partners. Live here in Austin, Texas the Cube's exclusive coverage Big fan of the show. that the evolution of the industry is really starting to get Yeah so you know a lot of things going on. and you know Brent and I were thinking together and we said look you don't need to do that. What are the biggest challenges of serverless and a lot of the details around configurating, and when you want to run a pod or container and you don't have to name names and you know to me I think it's important that we remember and so that's the main benefit that we hear. of company the size of Microsoft. and building solutions that meet the market so it's not like they're foreign to region areas but I mean you got to remember Microsoft is I want to ask you about two things that comes up a lot, and has a lot of sex appeal as the AI machine learning Yeah and you know I think that AI and deep learning goes from 0 to 100 and 4.5 seconds. in this case they happen to be data scientists, I don't know but in order of magnitude in enhancement so back in the early days and I think that you know After this is over we'll go set it up. and is that going to be an enhanced feature Now can you do it? Yeah, they do you things sometimes Yeah and of course back to the deep learning example and share some inside baseball. it's a really exciting place to work. And one of the top things that you have on your p.m. list across the board and really making this palatable and things like v-net support for all of our services, Great to have you on and love to talk more about It's the Cube live coverage I'm John Ferrier

ENTITIES

Entity	Category	Confidence
John Ferrier	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
Gabe Monroy	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
Stu Miniman	PERSON	0.99+
Gabe	PERSON	0.99+
Red Hat	ORGANIZATION	0.99+
July	DATE	0.99+
MySQL	TITLE	0.99+
Austin, Texas	LOCATION	0.99+
Dais	ORGANIZATION	0.99+
Elon Musk	PERSON	0.99+
0	QUANTITY	0.99+
one	QUANTITY	0.99+
KubeCon	EVENT	0.99+
Cloud Foundry Foundation	ORGANIZATION	0.99+
Kubernetes	TITLE	0.99+
4.5 seconds	QUANTITY	0.99+
first	QUANTITY	0.99+
Cube	ORGANIZATION	0.99+
SiliconANGLE	ORGANIZATION	0.99+
Tesla	ORGANIZATION	0.98+
100	QUANTITY	0.98+
third year	QUANTITY	0.98+
Brent	PERSON	0.98+
ACI	ORGANIZATION	0.98+
CloudNativeCon	EVENT	0.98+
three things	QUANTITY	0.98+
One	QUANTITY	0.98+
PHP	TITLE	0.98+
Linux	ORGANIZATION	0.97+
double	QUANTITY	0.97+
second year	QUANTITY	0.97+
two trends	QUANTITY	0.97+
OpenAI	ORGANIZATION	0.97+
today	DATE	0.96+
zero	QUANTITY	0.96+
Wikibon	ORGANIZATION	0.96+
Over 4500 people	QUANTITY	0.95+
KubeCon 2017	EVENT	0.94+
last week	DATE	0.94+
CloudNativeCon 2017	EVENT	0.94+
zero VMs	QUANTITY	0.93+
Teslas	ORGANIZATION	0.93+
Two very important trends	QUANTITY	0.92+
Azure	TITLE	0.91+
WordPress	ORGANIZATION	0.89+
WordPress	TITLE	0.89+
two things	QUANTITY	0.89+
00:15:21	DATE	0.88+
about seven months ago	DATE	0.87+

Nenshad Bardoliwalla & Pranav Rastogi | BigData NYC 2017

>> Announcer: Live from Midtown Manhattan it's theCUBE. Covering Big Data New York City 2017. Brought to you by SiliconANGLE Media and its ecosystem sponsors. >> OK, welcome back everyone we're here in New York City it's theCUBE's exclusive coverage of Big Data NYC, in conjunction with Strata Data going on right around the corner. It's out third day talking to all the influencers, CEO's, entrepreneurs, people making it happen in the Big Data world. I'm John Furrier co-host of theCUBE, with my co-host here Jim Kobielus who is the Lead Analyst at Wikibon Big Data. Nenshad Bardoliwalla. >> Bar-do-li-walla. >> Bardo. >> Nenshad Bardoliwalla. >> That guy. >> Okay, done. Of Paxata, Co-Founder & Chief Product Officer it's a tongue twister, third day, being from Jersey, it's hard with our accent, but thanks for being patient with me. >> Happy to be here. >> Pranav Rastogi, Product Manager, Microsoft Azure. Guys, welcome back to theCUBE, good to see you. I apologize for that, third day blues here. So Paxata, we had your partner on Prakash. >> Prakash. >> Prakash. Really a success story, you guys have done really well launching theCUBE fun to watch you guys from launching to the success. Obviously your relationship with Microsoft super important. Talk about the relationship because I think this is really people can start connecting the dots. >> Sure, maybe I'll start and I'LL be happy to get Pranav's point of view as well. Obviously Microsoft is one of the leading brands in the world and there are many aspects of the way that Microsoft has thought about their product development journey that have really been critical to the way that we have thought about Paxata as well. If you look at the number one tool that's used by analysts the world over it's Microsoft Excel. Right, there isn't even anything that's a close second. And if you look at the the evolution of what Microsoft has done in many layers of the stack, whether it's the end user computing paradigm that Excel provides to the world. Whether it's all of their recent innovation in both hybrid cloud technologies as well as the big data technologies that Pranav is part of managing. We just see a very strong synergy between trying to combine the usage by business consumers of being able to take advantage of these big data technologies in a hybrid cloud environment. So there's a very natural resonance between the 2 companies. We're very privileged to have Microsoft Ventures as an investor in Paxata and so the opportunity for us to work with one of the great brands of all time in our industry was really a privilege for us. Yeah, and that's the corporate sides so that wasn't actually part of it. So it's a different part of Microsoft which is great. You have also business opportunity with them. >> Nenshad : We do. >> Obviously data science problem that we're seeing is that they need to get the data faster. All that prep work, seems to be the big issue. >> It does and maybe we can get Pranav's point of view from the Microsoft angle. >> Yeah so to sort of continue what Nenshad was saying, you know the data prep in general is sort of a key core competence which is problematic for lots of users, especially around the knowledge that you need to have in terms of the different tools you can use. Folks who are very proficient will do ETL or data preparation like scenarios using one of the computing engines like Hive or Spark. That's good, but there's this big audience out there who like Excel-like interface, which is easy to use a very visually rich graphical interface where you can drag and drop and can click through. And the idea behind all of this is how quickly can I get insights from my data faster. Because in a big data space, it's volume, variety and velocity. So data is coming at a very fast rate. It's changing it's growing. And if you spend lot of time just doing data prep you're losing the value of data, or the value of data would change over time. So what we're trying to do would sort of enabling Paxata or HDInsight is enabling these users to use Paxata, get insights from data faster by solving key problems of doing data prep. >> So data democracy is a term that we've been kicking around, you guys have been talking about as well. What is actually mean, because we've been teasing out first two days here at theCUBE and BigData NYC is. It's clear the community aspect of data is growing, almost on a similar path as you're seeing with open source software. That genie's out the bottle. Open source software, tier one, it won, it's only growing exponentially. That same paradigm is moving into the data world where the collaboration is super important, in this data democracy, what is that actually mean and how does that relate to you guys? >> So the perspective we have is that first something that one of our customers said, that is there is no democracy without certain degrees of governance. We all live in a in a democracy. And yet we still have rules that we have to abide by. There are still policies that society needs to follow in order for us to be successful citizens. So when when a lot of folks hear the term democracy they really think of the wild wild west, you know. And a lot of the analytic work in the enterprise does have that flavor to it, right, people download stuff to their desktop, they do a little bit of massaging of the data. They email that to their friend, their friend then makes some changes and next thing you know we have what what some folks affectionately call spread mart hell. But if you really want to democratize the technology you have to wrap not only the user experience, like Pranav described, into something that's consumable by a very large number of folks in the enterprise. You have to wrap that with the governance and collaboration capabilities so that multiple people can work off the same data set. That you can apply the permissions so that people, who is allowed to share with each other and under what circumstances are they allowed to share. Under what circumstances are you allowed to promote data from one environment to another? It may be okay for someone like me to work in a sandbox but I cannot push that to a database or HDFS or Azure BLOB storage unless I actually have the right permissions to do so. So I think what you're seeing is that, in general, technology is becoming a, always goes on this trend, towards democratization. Whether it's the phone, whether it's the television, whether it's the personal computer and the same thing is happening with data technologies and certainly companies like. >> Well, Pranav, we're talking about this when you were on theCUBE yesterday. And I want to get your thoughts on this. The old way to solve the governance problem was to put data in silos. That was easy, I'll just put it in a silo and take care of it and access control was different. But now the value of the data is about cross-pollinating and make it freely available, horizontally scalable, so that it can be used. But the same time and you need to have a new governance paradigm. So, you've got to democratize the data by making it available, addressable and use for apps. The same time there's also the concerns on how do you make sure it doesn't get in the wrong hands and so on and so forth. >> Yeah and which is also very sort of common regarding open source projects in the cloud is a how do you ensure that the user authorized to access this open source project or run it has the right credentials is authorized and stuff. So, the benefit that you sort of get in the cloud is there's a centralized authentication system. There's Azure Active Directory, so you know most enterprise would have Active Directory users. Who are then authorized to either access maybe this cluster, or maybe this workload and they can run this job and that sort of further that goes down to the data layer as well. Where we have active policies which then describe what user can access what files and what folders. So if you think about the entrance scenario there is authentication and authorization happening and for the entire system when what user can access what data. And part of what Paxata brings in the picture is like how do you visualize this governance flow as data is coming from various sources, how do you make sure that the person who has access to data does have access data, and the one who doesn't cannot access data. >> Is that the problem with data prep is just that piece of it? What is the big problem with data prep, I mean, that seems to be, everyone keeps coming back to the same problem. What is causing all this data prep. >> People not buying Paxata it's very simple. >> That's a good one. Check out Paxata they're going to solve your problems go. But seriously, there seems to be the same hole people keep digging themselves into. They gather their stuff then next thing they're in the in the same hole they got to prepare all this stuff. >> I think the previous paradigms for doing data preparation tie exactly to the data democracy themes that we're talking about here. If you only have a very silo'd group of people in the organization with very deep technical skills but don't have the business context for what they're actually trying to accomplish, you have this impedance mismatch in the organization between the people who know what they want and the people who have the tools to do it. So what we've tried to do, and again you know taking a page out of the way that Microsoft has approached solving these problems you know both in the past in the present. Is to say look we can actually take the tools that once were only in the hands of the, you know, shamans who know how to utter the right incantations and instead move that into the the common folk who actually. >> The users. >> The users themselves who know what they want to do with the data. Who understand what those data elements mean. So if you were to ask the Paxata point of view, why have we had these data prep problems? Because we've separated the people who had the tools from the people who knew what they wanted to do with it. >> So it sounds to me, correct me if this is the wrong term, that what you offer in your partnership is it basically a broad curational environment for knowledge workers. You know, to sift and sort and annotating shared data with the lineage of the data preserved in essentially a system of record that can follow the data throughout its natural life. Is that a fair characterization? >> Pranav: I would think so yeah. >> You mention, Pranav, the whole issue of how one visualizes or should visualize this entire chain of custody, as it were, for the data, is there is there any special visualization paradigm that you guys offer? Now Microsoft, you've made a fairly significant investment in graph technology throughout your portfolio. I was at Build back in May and Sacha and the others just went to town on all things to do with Microsoft Graph, will that technology be somehow at some point, now or in the future, be reflected in this overall capability that you've established here with your partner here Paxata? >> I am not sure. So far, I think what you've talked about is some Graph capabilities introduced from the Microsoft Graph that's sort of one extreme. The other side of Graph exists today as a developer you can do some Graph based queries. So you can go to Cosmos DB which had a Gremlin API. For Graph based query, so I don't know how. >> I'll get right to the question. What's the Paxata benefits of with HDInsight? How does that, just quickly, explain for the audience. What is that solution, what are the benefits? >> So the the solution is you get a one click install of installing Paxata HDInsight and the benefit is as a benefit for a user persona who's not, sort of, used to big data or Hadoop they can use a very familiar GUI-based experience to get their insights from data faster without having any knowledge of how Spark works or Hadoop works. >> And what does the Microsoft relationship bring to the table for Paxata? >> So I think it's a couple of things. One is Azure is clearly growing at an extremely fast pace. And a lot of the enterprise customers that we work with are moving many of their workloads to Azure and and these cloud based environments. Especially for us, the unique value proposition of a partner who truly understands the hybrid nature of the world. The idea that everything is going to move to the cloud or everything is going to stay on premise is too simplistic. Microsoft understood that from day one. That data would be in it and all of those different places. And they've provided enabling technologies for vendors like us. >> I'll just say it to maybe you're too coy to say it, but the bottom line is you have an Excel-like interface. They have Office 365 they're user's going to instantly love that interface because it's an easy to use interface an Excel-like it's not Excel interface per se. >> Similar. >> Metaphor, graphical user interface. >> Yes it is. >> It's clean and it's targeted at the analyst role or user. >> That's right. >> That's going to resonate in their install base. >> And combined with a lot of these new capabilities that Microsoft is rolling out from a big data perspective. So HDInsight has a very rich portfolio of runtime engines and capabilities. They're introducing new data storage layers whether it's ADLS or Azure BLOB storage, so it's really a nice way of us working together to extract and unlock a lot of the value that Microsoft. >> So, here's the tough question for you, open source projects I see Microsoft, comments were hell froze because LINUX is now part of their DNA, which was a comment I saw at the even this week in Orlando, but they're really getting behind open source. From open compute, it's just clearly new DNA's. They're they're into it. How are you guys working together in open source and what's the impact to developers because now that's only one cloud, there's other clouds out there so data's going to be an important part of it. So open source, together, you guys working together on that and what's the role for the data? >> From an open source perspective, Microsoft plays a big role in embracing open source technologies and making sure that it runs reliably in the cloud. And part of that value prop that we provide in sort of Azure HDInsight is being sure that you can run these open source big data workloads reliably in the cloud. So you can run open source like Apache, Spark, Hive, Storm, Kafka, R Server. And the hard part about running open source technology in the cloud is how do you fine tune it, and how do you configure it, how do you run it reliably. And that's what sort of what we bring in from a cloud perspective. And we also contribute back to the community based on sort of what learned by running these workloads in the cloud. And we believe you know in the broader ecosystem customers will sort of have a mixture of these combinations and their solution They'll be using some of the Microsoft solutions some open source solutions some solutions from ecosystem that's how we see our customer solution sort of being built today. >> What's the big advantage you guys have at Paxata? What's the key differentiator for why someone should work with you guys? Is it the automation? What's the key secret sauce to you guys? >> I think it's a couple of dimensions. One is I think we have come the closest in the industry to getting a user experience that matches the Excel target user. A lot of folks are attempting to do the same but the feedback we consistently get is that when the Excel user uses our solution they just, they get it. >> Was there a design criteria, was that from the beginning how you were going to do this? >> From day one. >> So you engineer everything to make it as simple as like Excel. >> We want people to use our system they shouldn't be coding, they shouldn't be writing scripts. They just need to be able. >> Good Excel you just do good macros though. >> That's right. >> So simple things like that right. >> But the second is being able to interact with the data at scale. There are a lot of solutions out there that make the mistake in our opinion of sampling very tiny amounts of data and then asking you to draw inferences and then publish that to batch jobs. Our whole approach is to smash the batch paradigm and actually bring as much into the interactive world as possible. So end users can actually point and click on 100 million rows of data, instead of the million that you would get in Excel, and get an instantaneous response. Verses designing a job in a batch paradigm and then pushing it through the the batch. >> So it's interactive data profiling over vast corpuses of data in the cloud. >> Nenshad: Correct. >> Nenshad Bardoliwalla thanks for coming on theCUBE appreciate it, congratulations on Paxata and Microsoft Azure, great to have you. Good job on everything you do with Azure. I want to give you guys props, with seeing the growth in the market and the investment's been going well, congratulations. Thanks for sharing, keep coverage here in BigData NYC more coming after this short break.

Published Date : Sep 28 2017

SUMMARY :

Brought to you by SiliconANGLE Media in the Big Data world. it's hard with our accent, So Paxata, we had your partner on Prakash. launching theCUBE fun to watch you guys has done in many layers of the stack, is that they need to get the data faster. from the Microsoft angle. the different tools you can use. and how does that relate to you guys? have the right permissions to do so. But the same time and you need to have So, the benefit that you sort of get in the cloud What is the big problem with data prep, But seriously, there seems to be the same hole and instead move that into the the common folk from the people who knew what they wanted to do with it. is the wrong term, that what you offer for the data, is there is there So you can go to Cosmos DB which had a Gremlin API. What's the Paxata benefits of with HDInsight? So the the solution is you get a one click install And a lot of the enterprise customers but the bottom line is you have an Excel-like interface. user interface. It's clean and it's targeted at the analyst role to extract and unlock a lot of the value So open source, together, you guys working together and making sure that it runs reliably in the cloud. A lot of folks are attempting to do the same So you engineer everything to make it as simple They just need to be able. Good Excel you just do But the second is being able to interact So it's interactive data profiling and Microsoft Azure, great to have you.

ENTITIES

Entity	Category	Confidence
Jim Kobielus	PERSON	0.99+
Jersey	LOCATION	0.99+
Microsoft	ORGANIZATION	0.99+
Excel	TITLE	0.99+
2 companies	QUANTITY	0.99+
John Furrier	PERSON	0.99+
New York City	LOCATION	0.99+
Orlando	LOCATION	0.99+
Nenshad	PERSON	0.99+
Bardo	PERSON	0.99+
Nenshad Bardoliwalla	PERSON	0.99+
third day	QUANTITY	0.99+
both	QUANTITY	0.99+
Office 365	TITLE	0.99+
yesterday	DATE	0.99+
SiliconANGLE Media	ORGANIZATION	0.99+
100 million rows	QUANTITY	0.99+
BigData	ORGANIZATION	0.99+
Paxata	ORGANIZATION	0.99+
Microsoft Ventures	ORGANIZATION	0.99+
Pranav Rastogi	PERSON	0.99+
first two days	QUANTITY	0.99+
one	QUANTITY	0.98+
One	QUANTITY	0.98+
million	QUANTITY	0.98+
second	QUANTITY	0.98+
Midtown Manhattan	LOCATION	0.98+
Spark	TITLE	0.98+
this week	DATE	0.98+
first	QUANTITY	0.97+
theCUBE	ORGANIZATION	0.97+
one click	QUANTITY	0.97+
Prakash	PERSON	0.97+
Azure	TITLE	0.97+
May	DATE	0.97+
Wikibon Big Data	ORGANIZATION	0.96+
Hadoop	TITLE	0.96+
Hive	TITLE	0.94+
today	DATE	0.94+
Strata Data	ORGANIZATION	0.94+
Pranav	PERSON	0.93+
NYC	LOCATION	0.93+
one cloud	QUANTITY	0.93+
2017	DATE	0.92+
Apache	ORGANIZATION	0.9+
Paxata	TITLE	0.9+
Graph	TITLE	0.89+
Pranav	ORGANIZATION	0.88+

Amit Walia, Informatica | BigData NYC 2017

>> Announcer: Live from midtown Manhattan, it's theCUBE. Covering Big Data New York City 2017. Brought to you by SiliconANGLE Media and its ecosystem sponsors. >> Okay welcome back everyone, live here in New York City it's theCUBE's coverage of Big Data NYC. It's our event we've been doing for five years in conjunction with Strata Hadoop now called Strata Data right around the corner, separate place. Every year we get the best voices tech. Thought leaders, CEO's, executives, entrepreneurs anyone who's bringing the signal, we share that with you. I'm John Furrier, the co-host of theCUBE. Eight years covering Big Data, since 2010, the original Hadoop world. I'm here with Amit Walia, who's the Executive Vice President, Chief Product Officer for Informatica. Welcome back, good to see you. >> Good to be here John. >> theCUBE alumni, always great to have you on. Love product we had everyone on from Hortonworks. >> I just saw that. >> Product guys are great, can share the road map and kind of connect the dots. As Chief Product Officer, you have to have a 20 mile stare into the future. You got to know what the landscape is today, where it's going to be tomorrow. So I got to ask you, where's it going to be tomorrow? It seems that the rubber's hit the road, real value has to be produced. The hype of AI is out there, which I love by the way. People can see through that but they get it's good. Where's the value today? That's what customers want to know. I got hybrid cloud on the table, I got a lot of security concerns. Governance is a huge problem. The European regulations are coming over the top. I don't have time to do IoT and these other things, or do I? I mean this is a lot of challenges but how do you see it playing out? >> I think, to be candid, it's the best of times. The changing times are the best of times because people can experiment. I would say if you step back and take a look, we've been talking for such a long time. If there was any time, where forget the technology jargon of infrastructure, cloud, IoT, data has become the currency for every enterprise right? Everybody wants data. I say like you know, business users want today's data yesterday to make a decision tomorrow. IT has always been in the business of data, everybody wants more data. But the point you're making is that while that has become more relevant to an enterprise, it brings into the lot of other things, GDPR, it brings governance, it brings security issues, I mean hybrid clouds, some data on-prem, some data on cloud but in essence, what I think every company has realized that they will live and die by how well do they predict the future with the data they have on all their customers, products, whatever it is, and that's the new normal. >> Well hate to say it, admit pat myself on the back, but we in theCUBE team and Wikibon saw this early. You guys did too, and I want to bring up a comment we've talked about a couple of years ago. One, you guys were in the data business, Informatica. You guys went private but that was an early indicator of the trend that everyone's going private now. And that's a signal. For the first time, private equity finance have had trumped bigger venture capital asset class financing. Which is a signal that the waves are coming. We're surfing these little waves right now, we think they're big but they big ones are coming. The indicator is everyone's retrenching. Private equity's a sign of undervaluation. They want to actually also transform maybe some of the product engineering side of it or go to market. Basically get the new surfboard. >> Yeah. >> For the big waves. >> I mean that was the premise for us too because we saw as we were chatting right. We knew the new world, which was going towards predictive analytics or AI. See data is the richest thing for AI to be applied to but the thing is that it requires some heavy lifting. In fact that was our thesis, that as we went private, look we can double down on things like cloud. Invest truly for the next four years which being in public markets sometimes is hard. So we step back and look where we are as you were acting from my cover today. Our big believers look, there's so much data, so many varying architecture, so many different places. People are in Azure, or AWS, on-prem, by the way, still on mainframe. That hasn't gone away, you go back to the large customers. But ultimately when you talk about the biggest, I would say the new normal, which is AI, which clearly has been overtalked about but in my opinion has been barely touched because the biggest application of machine learning is on data. And that predicts things, whether you want to predict forecasting, or you predict something will come down or you can predict, and that's what we believe is where the world is going to go and that's what we double down on with our Claire technology. Just go deep, bring AI to data across the enterprise. >> We got to give you guys props, you guys are right on the line. I got to say as a product person myself, I see you guys executing great strategy, you've been very complimentary to your team, think you're doing a great job. Let's get back to AI. I think if you look at the hype cycles of things, IoT certainly has, still think there's a lot more hype to have there, there's so much more to do there. Cloud was overhyped, remember cloud washing? Pexus back in 2010-11, oh they're just cloud washing. Well that's a sign that ended up becoming what everyone was kind of hyping up. It did turn out. AI thinks the same thing. And I think it's real because you can almost connect the dots and be there but the reality is, is that it's just getting started. And so we had Rob Thomas from IBM on theCUBE and, you know we were talking. He made a comment, I want to get your reaction to, he said, "You can't have AI without IA." Information architecture. And you're in the information Informatica business you guys have been laying out an architecture specifically around governance. You guys kind of saw that early too. You can't just do AI, AI needs to be trained as data models. There's a lot of data involved that feeds AI. Who trains the machines that are doing the learning? So, you know, all these things come into play back to data. So what is the preferred information architecture, IA, that can power AI, artificial intelligence? >> I think it's a great question. I think of what typically, we recommend and we see large companies do look in the current complex architectures the companies are in. Hybrid cloud, multicloud, old architecture. By the way mainframe, client server, big data, you pick your favorite archit, everything exists for any enterprise right. People are not, companies are not going to move magically, everything to one place, to just start putting data in one place and start running some kind of AI on it. Our belief is that that will get organized around metadata. Metadata is data about data right? The organizing principle for any enterprise has to be around metadata. Leave your data wherever it is, organize your metadata, which is a much lighter footprint and then, that layer becomes the true central nervous system for your new next gen information architecture. That's the layer on which you apply machine learning too. So a great example is look, take GDPR. I mean GDPR is, if I'm a distributor, large companies have their GDPR. I mean who's touching my data? Where is my data coming from? Which database has sensitive data? All of these things are such complex problems. You will not move everything magically to one place. You will apply metadata approach to it and then machine learning starts to telling you gee I some anomaly detection. You see I'm seeing some data which does not have access to leave the geographical boundaries, of lets say Germany, going to, let's say UK. Those are kind of things that become a lot easier to solve once you go organize yourself at the metadata layer and that's the layer on which you apply AI. To me, that's the simplest way to describe as the organizing principle of what I call the data architecture or the information architecture for the next ten years. >> And that metadata, you guys saw that earlier, but how does that relate to these new things coming in because you know, one would argue that the ideal preferred infrastructure would be one that says hey no matter what next GDPR thing will happen, there'll be another Equifax that's going to happen, there'll be some sort of state sponsor cyber attack to the US, all these things are happening. I mean hell, all securities attacks are going up-- >> Security's a great example of that. We saw it four years ago you know, and we worked on a metadata driven approach to security. Look I've been on the security business however that's semantic myself. Security's a classic example of where it was all at the infrastructure layer, network, database, server. But the problem is that, it doesn't matter. Where is your database? In the cloud. Where is your network? I mean, do you run a data center anymore right? If I may, figuratively you don't. Ultimately, it's all about the data. The way at which we are going and we want more users like you and me access to data. So security has to be applied at the data layer. So in that context, I just talked about the whole metadata driven approach. Once you have the context of your data, you can apply governance to your data, you can apply security to your data, and as you keep adding new architectures, you do not have to create a paddle architecture you have to just append your metadata. So security, governance, hybrid cloud, all of those things become a lot easier for you, versus clearing one new architecture after another which you can never get to. >> Well people will be afraid of malware and these malicious attacks so auditing becomes now a big thing. If you look at the Equifax, it might take on, I have some data on that show that there was other action, they were fleeced out for weeks and months before the hack was even noticed. >> All this happens. >> I mean, they were ten times phished over even before it was discovered. They were inside, so audit trail would be interesting. >> Absolutely, I'll give you, typically, if you read any external report this is nothing tied to Equifax. It takes any enterprise three months minimum to figure out they're under attack. And now if a sophisticated attacker always goes to right away when they enter your enterprise, they're finding the weakest link. You're as secure as your weakest link in security. And they will go to some data trail that was left behind by some business user who moved onto the next big thing. But data was still flowing through that pipe. Or by the way, the biggest issue is inside our attack right? You will have somebody hack your or my credentials and they don't download like Snowden, a big fat document one day. They'll go drip by drip by drip by drip. You won't even know that. That again is an anomaly detection thing. >> Well it's going to get down to the firmware level. I mean look at the sophisticated hacks in China, they run their own DNS. They have certificates, they hack the iPhones. They make the phones and stuff, so you got to assume packing. But now, it's knowing what's going on and this is really the dynamic nature. So we're in the same page here. I'd love to do a security feature, come into the studio in our office at Palo Alto, think that's worthy. I just had a great cyber chat with Vidder, CTO of Vidder. Junaid is awesome, did some work with the government. But this brings up the question around big data. The landscape that we're in is fast and furious right now. You have big data being impacted by cloud because you have now unlimited compute, low latency storage, unlimited power source in that engine. Then you got the security paradigm. You could argue that that's going to slow things down maybe a little bit, but it also is going to change the face of big data. What is your reaction to the impact to security and cloud to big data? Because even though AI is the big talk of the show, what's really happening here at Strata Data is it's no longer a data show, it's a cloud and security show in my opinion. >> I mean cloud to me is everywhere. It was the, when Hadoop started it was on-prem but it's pretty much in the cloud and look at AWS and Azure, everyone runs natively there, so you're exactly right. To me what has happened is that, you're right, companies look at things two ways. If I'm experimenting, then I can look at it in a way where I'm not, I'm in dev mode. But you're right. As things are getting more operational and production then you have to worry about security and governance. So I don't think it's a matter of slowing down, it's a nature of the business where you can be fast and experiment on one side, but as you go prod, as you go real operational, you have to worry about controls, compliance and governance. By the way in that case-- >> And by the way you got to know what's going on, you got to know the flows. A data lake is a data lake, but you got the Niagara falls >> That's right. >> streaming content. >> Every, every customer of ours who's gone production they always want to understand full governance and lineage in the data flow. Because when I go talk to a regulator or I got talk to my CEO, you may have hundred people going at the data lake. I want to know who has access to it, if it's a production data lake, what are they doing, and by the way, what data is going in. The other one is, I mean walk around here. How much has changed? The world of big data or the wild wild west. Look at the amount of consolidation that has happened. I mean you see around the big distribution right? To me it's going to continue to happen because it's a nature of any new industry. I mean you looked at securities, cyber security big data, AI, you know, massive investment happens and then as customers want to truly go to scale they say look I can only bet on a few that can not only scale, but had the governance and compliance of what a large company wants. >> The waves are coming, there's no doubt about it. Okay so, let me get your reaction to end this segment. What's Informatica doing right now? I mean I've seen a whole lot 'cause we've cover you guys with the show and also we keep in touch, but I want you to spend a minute to talk about why you guys are better than what's out there on the floor. You have a different approach, why are customers working with you and if the folks aren't working with you yet, why should they work with Informatica? >> Our approach in a way has changed but not changed. We believe we operate in what we call the enterprise cloud data management. Our thing is look, we embrace open source. Open source, parks, parkstreaming, Kafka, you know, Hive, MapReduce, we support them all. To us, that's not where customers are spending their time. They're spending their time, once I got all that stuff, what can I do with it? If I'm truly building next gen predictive analytics platform I need some level of able to manage batch and streaming together. I want to make sure that it can scale. I want to make sure it has security, it has governance, it has compliance. So customers work with us to make sure that they can run a hybrid architecture. Whether it is cloud on-prem, whether it is traditional or big data or IoT, all in once place, it is scale-able and it has governance and compliance bricked into it. And then they also look for somebody that can provide true things like, not only data integration, quality, cataloging, all of those things, so when we working with large or small customers, whether you are in dev or prod, but ultimately helping you, what I call take you from an experiment stage to a large scale operational stage. You know, without batting an eyelid. That's the business we are in and in that case-- >> So you are in the business of operationalizing data for customers who want to add scale. >> Our belief is, we want to help our customers succeed. And customers will only succeed, not just by experimenting, but taking their experiments to production. So we have to think of the entire lifecycle of a customer. We cannot stop and say great for experiments, sorry don't go operational with us. >> So we've had a theme here in theCUBE this week called, I'm calling it, don't be a tool, and too many tools are out there right now. We call it the tool shed phenomenon. The tool shed phenomenon is customers aren't, they're tired of having too many tools and they bought a hammer a couple years ago that wants to try to be a lawn mower now and so you got to understand the nature of having great tooling, which you need which defines the work, but don't confuse a tool with a platform. And this is a huge issue because a lot of these companies that are flowing by wayside are groping for platforms. >> So there are customers tell us the same thing, which is why we-- >> But tools have to work in context. >> That's exactly, so that's why you heard, we talked about that for the last couple, it was the intelligent data platform. Customers don't buy a platform but all of our products, like are there microservices on our platform. Customers want to build the next gen data management platform, which is the intelligent data platform. A lot of little things are features or tools along the way but if I am a large bank, if I'm a large airline, and I want to go at scale operational, I can't stitch hundred tools and expect to run my IT shop from there. >> Yeah >> I can't I will never be able to do it. >> There's good tools out there that have a nice business model, lifestyle business or cashflow business, or even tools that are just highly focused and that's all they do and that's great. It's the guys who try to become something that they're not. It's hard, it's just too difficult. >> I think you have to-- >> The tool shed phenomenon is real. >> I think companies have to realize whether they are a feature. I always say are you a feature or are you a product? You have to realize the difference between the two and in between sits our tool. (John laughing) >> Well that quote came, the tool comment came from one of our chief data officers, that was kind of sparked the conversation but people buy a hammer, everything looks like a nail and you don't want to mow your lawn with a hammer, get a lawn mower right? Do the right tool for the job. But you have to platform, the data has to have a holistic view. >> That's exactly right. The intelligent data platform, that's what we call it. >> What's new with Informatica, what's going on? Give us a quick update, we'll end the segment with a quick update on Informatica. What do you got going on, what events are coming up? >> Well we just came off a very big release, we call it 10-2 which had lot of big data, hybrid cloud, AI and catalog and security and governance, all five of them. Big release, just came out and basically customers are adopting it. Which obviously was all centered around the things we talked in Informatica. Again, single platform, cloud, hybrid, big data, streaming and governance and compliance. And then right now, we are basically in the middle, after Informatica, we go on as barrage of tours across multiple cities across the globe so customers can meet us there. Paris is coming up, I was in London a few weeks ago. And then separately we're getting up for coming up, I will probably see you there at Amazon re:Invent. I mean we are obviously all-in partner for-- >> Do you have anything in China? >> China is a- >> Alibaba? >> We're working with them, I'll leave it there. >> We'll be in Alibaba in two weeks for their cloud event. >> Excellent. >> So theCUBE is breaking into China, CUBE China. We need some translators so if anyone out there wants to help us with our China blog. >> We'll be at Dreamforce. We were obviously, so you'll see us there. We were at Amazon Ignite, obviously very close to- >> re:Invent will be great. >> Yeah we will be there and Amazon obviously is a great partner and by the way a great customer of ours. >> Well congratulations, you guys are doing great, Informatica. Great to see the success. We'll see you at re:Invent and keep in touch. Amit Walia, the Executive Vice President, EVP, Chief Product Officer, Informatica. They get the platform game, they get the data game, check em out. It's theCUBE ending day two coverage. We've got a big event tonight. We're going to be streaming live our research that we are going to be rolling out here at Big Data NYC, our even that we're running in conjunction with Strata Data. They run their event, we run our event. Thanks for watching and stay tuned, stay with us. At five o'clock, live Wikibon coverage of their new research and then Party at Seven, which will not be filmed, that's when we're going to have some cocktails. I'm John Furrier, thanks for watching. Stay tuned. (techno music)

Published Date : Sep 28 2017

SUMMARY :

Brought to you by SiliconANGLE Media I'm John Furrier, the co-host of theCUBE. theCUBE alumni, always great to have you on. and kind of connect the dots. I say like you know, business users want today's data of the product engineering side of it or go to market. See data is the richest thing for AI to be applied to We got to give you guys props, and that's the layer on which you apply AI. And that metadata, you guys saw that earlier, and we want more users like you and me access to data. I have some data on that show that there was other action, I mean, they were if you read any external report I mean look at the sophisticated hacks in China, it's a nature of the business where you can be fast And by the way you got to know what's going on, I mean you see around the big distribution right? and if the folks aren't working with you yet, That's the business we are in and in that case-- So you are in the business of operationalizing data but taking their experiments to production. and so you got to understand the nature That's exactly, so that's why you heard, I will never be able to do it. It's the guys who try to become something that they're not. I always say are you a feature or are you a product? and you don't want to mow your lawn with a hammer, The intelligent data platform, that's what we call it. What do you got going on, what events are coming up? I will probably see you there at Amazon re:Invent. wants to help us with our China blog. We were obviously, so you'll see us there. is a great partner and by the way a great customer of ours. you guys are doing great, Informatica.

ENTITIES

Entity	Category	Confidence
Amit Walia	PERSON	0.99+
London	LOCATION	0.99+
Alibaba	ORGANIZATION	0.99+
Amazon	ORGANIZATION	0.99+
IBM	ORGANIZATION	0.99+
John Furrier	PERSON	0.99+
China	LOCATION	0.99+
ten times	QUANTITY	0.99+
Informatica	ORGANIZATION	0.99+
John	PERSON	0.99+
Equifax	ORGANIZATION	0.99+
New York City	LOCATION	0.99+
yesterday	DATE	0.99+
Rob Thomas	PERSON	0.99+
tomorrow	DATE	0.99+
five years	QUANTITY	0.99+
hundred people	QUANTITY	0.99+
SiliconANGLE Media	ORGANIZATION	0.99+
20 mile	QUANTITY	0.99+
three months	QUANTITY	0.99+
Paris	LOCATION	0.99+
today	DATE	0.99+
five	QUANTITY	0.99+
Wikibon	ORGANIZATION	0.99+
two	QUANTITY	0.99+
Hortonworks	ORGANIZATION	0.99+
iPhones	COMMERCIAL_ITEM	0.99+
theCUBE	ORGANIZATION	0.99+
2010	DATE	0.99+
one side	QUANTITY	0.99+
UK	LOCATION	0.99+
Palo Alto	LOCATION	0.98+
Germany	LOCATION	0.98+
AWS	ORGANIZATION	0.98+
one	QUANTITY	0.98+
four years ago	DATE	0.98+
one place	QUANTITY	0.98+
Dreamforce	ORGANIZATION	0.98+
two ways	QUANTITY	0.98+
Eight years	QUANTITY	0.98+
Vidder	ORGANIZATION	0.98+
2010-11	DATE	0.98+
tonight	DATE	0.97+
GDPR	TITLE	0.97+
NYC	LOCATION	0.97+
Junaid	PERSON	0.97+
this week	DATE	0.97+
MapReduce	ORGANIZATION	0.96+
Pexus	ORGANIZATION	0.95+
One	QUANTITY	0.95+
two weeks	QUANTITY	0.95+
five o'clock	DATE	0.94+
first time	QUANTITY	0.94+
big	EVENT	0.94+
single platform	QUANTITY	0.92+
CTO	PERSON	0.92+
Strata Hadoop	ORGANIZATION	0.91+
Claire	ORGANIZATION	0.9+
Strata Data	ORGANIZATION	0.89+
US	LOCATION	0.88+

Yaron Haviv, iguazio | BigData NYC 2017

>> Announcer: Live from midtown Manhattan, it's theCUBE, covering BigData New York City 2017, brought to you by SiliconANGLE Media and its ecosystem sponsors. >> Okay, welcome back everyone, we're live in New York City, this is theCUBE's coverage of BigData NYC, this is our own event for five years now we've been running it, been at Hadoop World since 2010, it's our eighth year covering the Hadoop World which has evolved into Strata Conference, Strata Hadoop, now called Strata Data, and of course it's bigger than just Strata, it's about big data in NYC, a lot of big players here inside theCUBE, thought leaders, entrepreneurs, and great guests. I'm John Furrier, the cohost this week with Jim Kobielus, who's the lead analyst on our BigData and our Wikibon team. Our next guest is Yaron Haviv, who's with iguazio, he's the founder and CTO, hot startup here at the show, making a lot of waves on their new platform. Welcome to theCUBE, good to see you again, congratulations. >> Yes, thanks, thanks very much. We're happy to be here again. >> You're known in the theCUBE community as the guy on Twitter who's always pinging me and Dave and team, saying, "Hey, you know, you guys got to "get that right." You really are one of the smartest guys on the network in our community, you're super-smart, your team has got great tech chops, and in the middle of all that is the hottest market which is cloud native, cloud native as it relates to the integration of how apps are being built, and essentially new ways of engineering around these solutions, not just repackaging old stuff, it's really about putting things in a true cloud environment, with an application development, with data at the center of it, you got a whole complex platform you've introduced. So really, really want to dig into this. So before we get into some of my pointed questions I know Jim's got a ton of questions, is give us an update on what's going on so you guys got some news here at the show, let's get to that first. >> So since the last time we spoke, we had tons of news. We're making revenues, we have customers, we've just recently GA'ed, we recently got significant investment from major investors, we raised about $33 million recently from companies like Verizon Ventures, Bosch, you know for IoT, Chicago Mercantile Exchange, which is Dow Jones and other properties, Dell EMC. So pretty broad. >> John: So customers, pretty much. >> Yeah, so that's the interesting thing. Usually you know investors are sort of strategic investors or partners or potential buyers, but here it's essentially our customers that it's so strategic to the business, we want to... >> Let's go with GA of the projects, just get into what's shipping, what's available, what's the general availability, what are you now offering? >> So iguazio is trying to, you know, you alluded to cloud native and all that. Usually when you go to events like Strata and BigData it's nothing to do with cloud native, a lot of hard labor, not really continuous development and integration, it's like continuous hard work, it's continuous hard work. And essentially what we did, we created a data platform which is extremely fast and integrated, you know has all the different forms of states, streaming and events and documents and tables and all that, into a very unique architecture, won't dive into that today. And on top of it we've integrated cloud services like Kubernetes and serverless functionality and others, so we can essentially create a hybrid cloud. So some of our customers they even deploy portions as an Opix-based settings in the cloud, and some portions in the edge or in the enterprise deployed the software, or even a prepackaged appliance. So we're the only ones that provide a full hybrid experience. >> John: Is this a SAS product? >> So it's a software stack, and it could be delivered in three different options. One, if you don't want to mess with the hardware, you can just rent it, and it's deployed in Equanix facility, we have very strong partnerships with them globally. If you want to have something on-prem, you can get a software reference architecture, you go and deploy it. If you're a telco or an IoT player that wants a manufacturing facility, we have a very small 2U box, four servers, four GPUs, all the analytics tech you could think of. You just put it in the factory instead of like two racks of Hadoop. >> So you're not general purpose, you're just whatever the customer wants to deploy the stack, their flexibility is on them. >> Yeah. Now it is an appliance >> You have a hosting solution? >> It is an appliance even when you deploy it on-prem, it's a bunch of Docker containers inside that you don't even touch them, you don't SSH to the machine. You have APIs and you have UIs, and just like the cloud experience when you go to Amazon, you don't open the Kimono, you know, you just use it. So our experience that's what we're telling customers. No root access problems, no security problems. It's a hardened system. Give us servers, we'll deploy it, and you go through consoles and UIs, >> You don't host anything for anyone? >> We host for some customers, including >> So you do whatever the customer was interested in doing? >> Yes. (laughs) >> So you're flexible, okay. >> We just want to make money. >> You're pretty good, sticking to the product. So on the GA, so here essentially the big data world you mentioned that there's data layers, like data piece. So I got to ask you the question, so pretend I'm an idiot for a second, right. >> Yaron: Okay. >> Okay, yeah. >> No, you're a smart guy. >> What problem are you solving. So we'll just go to the simple. I love what you're doing, I assume you guys are super-smart, which I can say you are, but what's the problem you're solving, what's in it for me? >> Okay, so there are two problems. One is the challenge everyone wants to transform. You know there is this digital transformation mantra. And it means essentially two things. One is, I want to automate my operation environment so I can cut costs and be more competitive. The other one is I want to improve my customer engagement. You know, I want to do mobile apps which are smarter, you know get more direct content to the user, get more targeted functionality, et cetera. These are the two key challenges for every business, any industry, okay? So they go and they deploy Hadoop and Hive and all that stuff, and it takes them two years to productize it. And then they get to the data science bit. And by the time they finished they understand that this Hadoop thing can only do one thing. It's queries, and reporting and BI, and data warehousing. How do you do actionable insights from that stuff, okay? 'Cause actionable insights means I get information from the mobile app, and then I translate it into some action. I have to enrich the vectors, the machine learning, all that details. And then I need to respond. Hadoop doesn't know how to do it. So the first generation is people that pulled a lot of stuff into data lake, and started querying it and generating reports. And the boss said >> Low cost data link basically, was what you say. >> Yes, and the boss said, "Okay, what are we going to do with this report? "Is it generating any revenue to the business?" No. The only revenue generation if you take this data >> You're fired, exactly. >> No, not all fired, but now >> John: Look at the budget >> Now they're starting to buy our stuff. So now the point is okay, how can I put all this data, and in the same time generate actions, and also deal with the production aspects of, I want to develop in a beta phase, I want to promote it into production. That's cloud native architectures, okay? Hadoop is not cloud, How do I take a Spark, Zeppelin, you know, a notebook and I turn it into production? There's no way to do that. >> By the way, depending on which cloud you go to, they have a different mechanism and elements for each cloud. >> Yeah, so the cloud providers do address that because they are selling the package, >> Expands all the clouds, yeah. >> Yeah, so cloud providers are starting to have their own offerings which are all proprietary around this is how you would, you know, forget about HDFS, we'll have S3, and we'll have Redshift for you, and we'll have Athena, and again you're starting to consume that into a service. Still doesn't address the continuous analytics challenge that people have. And if you're looking at what we've done with Grab, which is amazing, they started with using Amazon services, S3, Redshift, you know, Kinesis, all that stuff, and it took them about two hours to generate the insights. Now the problem is they want to do driver incentives in real time. So they want to incent the driver to go and make more rides or other things, so they have to analyze the event of the location of the driver, the event of the location of the customers, and just throwing messages back based on analytics. So that's real time analytics, and that's not something that you can do >> They got to build that from scratch right away. I mean they can't do that with the existing. >> No, and Uber invested tons of energy around that and they don't get the same functionality. Another unique feature that we talk about in our PR >> This is for the use case you're talking about, this is the Grab, which is the car >> Grab is the number one ride-sharing in Asia, which is bigger than Uber in Asia, and they're using our platform. By the way, even Uber doesn't really use Hadoop, they use MemSQL for that stuff, so it's not really using open source and all that. But the point is for example, with Uber, when you have a, when they monetize the rides, they do it just based on demand, okay. And with Grab, now what they do, because of the capability that we can intersect tons of data in real time, they can also look at the weather, was there a terror attack or something like that. They don't want to raise the price >> A lot of other data points, could be traffic >> They don't want to raise the price if there was a problem, you know, and all the customers get aggravated. This is actually intersecting data in real time, and no one today can do that in real time beyond what we can do. >> A lot of people have semantic problems with real time, they don't even know what they mean by real time. >> Yaron: Yes. >> The data could be a week old, but they can get it to them in real time. >> But every decision, if you think if you generalize round the problem, okay, and we have slides on that that I explain to customers. Every time I run analytics, I need to look at four types of data. The context, the event, okay, what happened, okay. The second type of data is the previous state. Like I have a car, was it up or down or what's the previous state of that element? The third element is the time aggregation, like, what happened in the last hour, the average temperature, the average, you know, ticker price for the stock, et cetera, okay? And the fourth thing is enriched data, like I have a car ID, but what's the make, what's the model, who's driving it right now. That's secondary data. So every time I run a machine learning task or any decision I have to collect all those four types of data into one vector, it's called feature vector, and take a decision on that. You take Kafka, it's only the event part, okay, you take MemSQL, it's only the state part, you take Hadoop it's only like historical stuff. How do you assemble and stitch a feature vector. >> Well you talked about complex machine learning pipeline, so clearly, you're talking about a hybrid >> It's a prediction. And actions based on just dumb things, like the car broke and I need to send a garage, I don't need machine learning for that. >> So within your environment then, do you enable the machine learning models to execute across the different data platforms, of which this hybrid environment is composed, and then do you aggregate the results of those models, runs into some larger model that drives the real time decision? >> In our solution, everything is a document, so even a picture is a document, a lot of things. So you can essentially throw in a picture, run tensor flow, embed more features into the document, and then query those features on another platform. So that's really what makes this continuous analytics extremely flexible, so that's what we give customers. The first thing is simplicity. They can now build applications, you know we have tier one now, automotive customer, CIO coming, meeting us. So you know when I have a project, one year, I need to have hired dozens of people, it's hugely complex, you know. Tell us what's the use case, and we'll build a prototype. >> John: All right, well I'm going to >> One week, we gave them a prototype, and he was amazed how in one week we created an application that analyzed all the streams from the data from the cars, did enrichment, did machine learning, and provided predictions. >> Well we're going to have to come in and test you on this, because I'm skeptical, but here's why. >> Everyone is. >> We'll get to that, I mean I'm probably not skeptical but I kind of am because the history is pretty clear. If you look at some of the big ideas out there, like OpenStack. I mean that thing just morphed into a beast. Hadoop was a cost of ownership nightmare as you mentioned early on. So people have been conceptually correct on what they were trying to do, but trying to get it done was always hard, and then it took a long time to kind of figure out the operational model. So how are you different, if I'm going to play the skeptic here? You know, I've heard this before. How are you different than say OpenStack or Hadoop Clusters, 'cause that was a nightmare, cost of ownership, I couldn't get the type of value I needed, lost my budget. Why aren't you the same? >> Okay, that's interesting. I don't know if you know but I ran a lot of development for OpenStack when I was in Matinox and Hadoop, so I patched a lot of those >> So do you agree with what I said? That that was a problem? >> They are extremely complex, yes. And I think one of the things that first OpenStack tried to bite on too much, and it's sort of a huge tent, everyone tries to push his agenda. OpenStack is still an infrastructure layer, okay. And also Hadoop is sort of a something in between an infrastructure and an application layer, but it was designed 10 years ago, where the problem that Hadoop tried to solve is how do you do web ranking, okay, on tons of batch data. And then the ecosystem evolved into real time, and streaming and machine learning. >> A data warehousing alternative or whatever. >> So it doesn't fit the original model of batch processing, 'cause if an event comes from the car or an IoT device, and you have to do something with it, you need a table with an index. You can't just go and build a huge Parquet file. >> You know, you're talking about complexity >> John: That's why he's different. >> Go ahead. >> So what we've done with our team, after knowing OpenStack and all those >> John: All the scar tissue. >> And all the scar tissues, and my role was also working with all the cloud service providers, so I know their internal architecture, and I worked on SAP HANA and Exodata and all those things, so we learned from the bad experiences, said let's forget about the lower layers, which is what OpenStack is trying to provide, provide you infrastructure as a service. Let's focus on the application, and build from the application all the way to the flash, and the CPU instruction set, and the adapters and the networking, okay. That's what's different. So what we provide is an application and service experience. We don't provide infrastructure. If you go buy VMware and Nutanix, all those offerings, you get infrastructure. Now you go and build with the dozen of dev ops guys all the stack above. You go to Amazon, you get services. Just they're not the most optimized in terms of the implementation because they also have dozens of independent projects that each one takes a VM and starts writing some >> But they're still a good service, but you got to put it together. >> Yeah right. But also the way they implement, because in order for them to scale is that they have a common layer, they found VMs, and then they're starting to build up applications so it's inefficient. And also a lot of it is built on 10-year-old baseline architecture. We've designed it for a very modern architecture, it's all parallel CPUs with 30 cores, you know, flash and NVMe. And so we've avoided a lot of the hardware challenges, and serialization, and just provide and abstraction layer pretty much like a cloud on top. >> Now in terms of abstraction layers in the cloud, they're efficient, and provide a simplification experience for developers. Serverless computing is up and coming, it's an important approach, of course we have the public clouds from AWS and Google and IBM and Microsoft. There are a growing range of serverless computing frameworks for prem-based deployment. I believe you are behind one. Can you talk about what you're doing at iguazio on serverless frameworks for on-prem or public? >> Yes, it's the first time I'm very active in CNC after Cloud Native Foundation. I'm one of the authors of the serverless white paper, which tries to normalize the definitions of all the vendors and come with a proposal for interoperable standard. So I spent a lot of energy on that, 'cause we don't want to lock customers to an API. What's unique, by the way, about our solution, we don't have a single proprietary API. We just emulate all the other guys' stuff. We have all the Amazon APIs for data services, like Kinesis, Dynamo, S3, et cetera. We have the open source APIs, like Kafka. So also on the serverless, my agenda is trying to promote that if I'm writing to Azure or AWS or iguazio, I don't need to change my app. I can use any developer tools. So that's my effort there. And we recently, a few weeks ago, we launched our open source project, which is a sort of second generation of something we had before called Nuclio. It's designed for real time >> John: How do you spell that? >> N-U-C-L-I-O. I even have the logo >> He's got a nice slick here. >> It's really fast because it's >> John: Nuclio, so that's open source that you guys just sponsor and it's all code out in the open? >> All the code is in the open, pretty cool, has a lot of innovative ideas on how to do stream processing and best, 'cause the original serverless functionality was designed around web hooks and HTTP, and even many of the open source projects are really designed around HTTP serving. >> I have a question. I'm doing research for Wikibon on the area of serverless, in fact we've recently published a report on serverless, and in terms of hybrid cloud environments, I'm not seeing yet any hybrid serverless clouds that involve public, you know, serverless like AWS Lambda, and private on-prem deployment of serverless. Do you have any customers who are doing that or interested in hybridizing serverless across public and private? >> Of course, and we have some patents I don't want to go into, but the general idea is, what we've done in Nuclio is also the decoupling of the data from the computation, which means that things can sort of be disjoined. You can run a function in Raspberry Pi, and the data will be in a different place, and those things can sort of move, okay. >> So the persistence has to happen outside the serverless environment, like in the application itself? >> Outside of the function, the function acts as the persistent layer through APIs, okay. And how this data persistence is materialized, that server separate thing. So you can actually write the same function that will run against Kafka or Kinesis or Private MQ, or HTTP without modifying the function, and ad hoc, through what we call function bindings, you define what's going to be the thing driving the data, or storing the data. So that can actually write the same function that does ETL drop from table one to table two. You don't need to put the table information in the function, which is not the thing that Lambda does. And it's about a hundred times faster than Lambda, we do 400,000 events per second in Nuclio. So if you write your serverless code in Nuclio, it's faster than writing it yourself, because of all those low-level optimizations. >> Yaron, thanks for coming on theCUBE. We want to do a deeper dive, love to have you out in Palo Alto next time you're in town. Let us know when you're in Silicon Valley for sure, we'll make sure we get you on camera for multiple sessions. >> And more information re:Invent. >> Go to re:Invent. We're looking forward to seeing you there. Love the continuous analytics message, I think continuous integration is going through a massive renaissance right now, you're starting to see new approaches, and I think things that you're doing is exactly along the lines of what the world wants, which is alternatives, innovation, and thanks for sharing on theCUBE. >> Great. >> That's very great. >> This is theCUBE coverage of the hot startups here at BigData NYC, live coverage from New York, after this short break. I'm John Furrier, Jim Kobielus, after this short break.

Published Date : Sep 27 2017

SUMMARY :

brought to you by SiliconANGLE Media I'm John Furrier, the cohost this week with Jim Kobielus, We're happy to be here again. and in the middle of all that is the hottest market So since the last time we spoke, we had tons of news. Yeah, so that's the interesting thing. and some portions in the edge or in the enterprise all the analytics tech you could think of. So you're not general purpose, you're just Now it is an appliance and just like the cloud experience when you go to Amazon, So I got to ask you the question, which I can say you are, So the first generation is people that basically, was what you say. Yes, and the boss said, and in the same time generate actions, By the way, depending on which cloud you go to, and that's not something that you can do I mean they can't do that with the existing. and they don't get the same functionality. because of the capability that we can intersect and all the customers get aggravated. A lot of people have semantic problems with real time, but they can get it to them in real time. the average temperature, the average, you know, like the car broke and I need to send a garage, So you know when I have a project, an application that analyzed all the streams from the data Well we're going to have to come in and test you on this, but I kind of am because the history is pretty clear. I don't know if you know but I ran a lot of development is how do you do web ranking, okay, and you have to do something with it, and build from the application all the way to the flash, but you got to put it together. it's all parallel CPUs with 30 cores, you know, Now in terms of abstraction layers in the cloud, So also on the serverless, my agenda is trying to promote I even have the logo and even many of the open source projects on the area of serverless, in fact we've recently and the data will be in a different place, So if you write your serverless code in Nuclio, We want to do a deeper dive, love to have you is exactly along the lines of what the world wants, I'm John Furrier, Jim Kobielus, after this short break.

ENTITIES

Entity	Category	Confidence
Jim Kobielus	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
IBM	ORGANIZATION	0.99+
Bosch	ORGANIZATION	0.99+
Uber	ORGANIZATION	0.99+
John	PERSON	0.99+
John Furrier	PERSON	0.99+
Verizon Ventures	ORGANIZATION	0.99+
Yaron Haviv	PERSON	0.99+
Asia	LOCATION	0.99+
NYC	LOCATION	0.99+
Google	ORGANIZATION	0.99+
New York City	LOCATION	0.99+
Jim	PERSON	0.99+
Palo Alto	LOCATION	0.99+
30 cores	QUANTITY	0.99+
New York	LOCATION	0.99+
AWS	ORGANIZATION	0.99+
two years	QUANTITY	0.99+
BigData	ORGANIZATION	0.99+
Silicon Valley	LOCATION	0.99+
Amazon	ORGANIZATION	0.99+
five years	QUANTITY	0.99+
two problems	QUANTITY	0.99+
Dell EMC	ORGANIZATION	0.99+
Yaron	PERSON	0.99+
One	QUANTITY	0.99+
Dave	PERSON	0.99+
Kafka	TITLE	0.99+
third element	QUANTITY	0.99+
SiliconANGLE Media	ORGANIZATION	0.99+
Dow Jones	ORGANIZATION	0.99+
two things	QUANTITY	0.99+
two racks	QUANTITY	0.99+
today	DATE	0.99+
Grab	ORGANIZATION	0.99+
Nuclio	TITLE	0.99+
two key challenges	QUANTITY	0.99+
Cloud Native Foundation	ORGANIZATION	0.99+
about $33 million	QUANTITY	0.99+
eighth year	QUANTITY	0.99+
Hadoop	TITLE	0.98+
second type	QUANTITY	0.98+
Lambda	TITLE	0.98+
10 years ago	DATE	0.98+
each cloud	QUANTITY	0.98+
Strata Conference	EVENT	0.98+
Equanix	LOCATION	0.98+
10-year-old	QUANTITY	0.98+
first thing	QUANTITY	0.98+
first generation	QUANTITY	0.98+
one	QUANTITY	0.98+
second generation	QUANTITY	0.98+
Hadoop World	EVENT	0.98+
first time	QUANTITY	0.98+
theCUBE	ORGANIZATION	0.97+
Nutanix	ORGANIZATION	0.97+
MemSQL	TITLE	0.97+
each one	QUANTITY	0.97+
2010	DATE	0.97+
Kinesis	TITLE	0.97+
SAS	ORGANIZATION	0.96+
Wikibon	ORGANIZATION	0.96+
Chicago Mercantile Exchange	ORGANIZATION	0.96+
about two hours	QUANTITY	0.96+
this week	DATE	0.96+
one thing	QUANTITY	0.95+
dozen	QUANTITY	0.95+

Jeff Veis, Actian | BigData NYC 2017

>> Live from Midtown Manhattan, it's the Cube. Covering big data, New York City 2017. Brought to you by SiliconANGLE Media and its ecosystem sponsors. >> Okay welcome back everyone, live here in New York City it's the Cube special annual presentation of BIGDATA NYC. This is our annual event in New York City where we talk to all the fall leaders and experts, CEOs, entrepreneurs and anyone making shaping the agenda with the Cube. In conjunction with STRATA DATA which was formally called STRATA HEDUP. HEDUP world, the Cube's NYC event. BIGDATA I want to see you separate from that when we're here. Which of these, who's the chief marketing acting of Cube alumni. Formerly with HPE, been on many times. Good to see you. >> Good to see you. >> Well you're a marketing genius we've talked before at HPE. You got so much experience in data and analytics, you've seen the swath of spectrum across the board from classic. I call classic enterprise to cutting edge. To now full on cloud, AI, machine learning, IOT. Lot of stuff going on, on premise seems to be hot still. There's so much going on from the large enterprises dealing with how to better use your analytics. At Acting you're heading up to marketing, what's the positioning? What're you doing there? >> Well the shift that we see and what's unique about Acting. Which has just a very differentiated and robust portfolio is the shift to what we refer to as hybrid data. And it's a shift that people aren't talking about, most of the competition here. They have that next best mouse trap, that one thing. So it's either move your database to the cloud or buy this appliance or move to this piece of open source. And it's not that they don't have interesting technologies but I think they're missing the key point. Which is never before have we seen the creation side of data and the consumption of data becoming more diverse, more dynamic. >> And more in demand too, people want both sides. Before we go any deeper I just want you to take a minute to define what is hybrid data actually mean. What does that term mean for the people that want to understand this term deeper. >> Well it's understanding that it's not just the location of it. Of course there's hybrid computing which is premised in cloud. And that's an important part of it. But there's also about where and how is that data created. What time domain is that data going to be consumed and used and that's so important. A lot of analytics, a lot of the guys across the street are kind of thinking about reporting in analytics and that old world way of. We collect lots of data and then we deliver analytics. But increasingly analytics is being used almost in real time or near real time. Because people are doing things with the data in the moment. Then another dimension of it is AdHawk discovery. Where you can have not one or two or three data scientists but dozens if not hundreds of people. All with copies of Tableau and Click attacking and hitting that data. And of course it's not one data source but multiple as they find adjacencies with data. A lot of the data may be outside of the four walls. So when you look at consumption ad creation of data the net net is you need not one solution but a collection of best fits. >> So a hybrid between consumption and creation so that's the two hybrids. I mean hybrid implies, you know little bit of this little bit of that. >> That's the bridge that you need to be able to cross. Which is where do I get that data? And then where's that data going? >> Great so lets get into Acting. Give us the update, obviously Acting has got a huge portfolio. We've covered you guys know best. Been on the Cube many times. They've cobbled together all these solutions that can be very affective for customers. Take us through the value proposition that this hybrid data enables with Acting. >> Well if you decompose it from our view point there's three pillars. That you kind of needed since the test of time in one sense. They're critical, which is the ability to manage the data. The ability to connect the data. In the old days we said integrate but now I think basically all apps, all kind of data sources are connected in some sense. Sometimes very temporal. And then finally the analytics. So you need those three pillars and you need to be able to orchestrate across them. And what we have is a collection of solutions that span that. They can do transactional data, they can do graph data and object oriented data. Today we're announcing a new generation of our analytics, specifically on HEDUP. And that's Vector H. Love to be able to talk to that today with the native spark integration. >> Lets get into the news. Hard news here at BIGDATA NYC is you guys announced the latest support for Apachi Spark so with Vector H. So Acting Vector in HEDUP, hence the H. What is it? >> Is Spark glue for hybrid data environments or is it something you layer over different underlying databases? >> Well I think it's fair to say it is becoming the glue. In fact we had a previous technology that did a humans job at doing some of the work. Now that we spark and that community. The thing though is if you wanted to take advantage of spark it was kind of like the old days of HEDUP. Assembly was required and that is increasingly not what organizations are looking for. They want to adopt the technology but they want to use it and get on with their day job. What we have done... >> Machine learning, putting algorithms in place, managing software. >> It could be very exonerate things such as predictive machines learning. Next generation AI. But for everyone of those there's an easy a dozen if not a hundred uses of being able to reach and extract data in their native formats. Be able to grab a Parke file and without any transformation being analyze it. Or being able to talk to an application and being able to interface with that. With being able to do reads and writes with zero penalty. So the asset compliance component of databases is critical and a lot of the traditional HEDUP approaches, pretty much read only vehicles. And that meant they were limited on the use cases they could use it. >> Lets talk about the hard news. What specifically was announced? >> Well we have a technology called Vector. Vector does run, just to establish the baseline here. It runs single node, Windows, Linux, and there's a community edition. So your users can download and use that right now. We have Vector H which was designed for scale out for HEDUP and it takes advantage of Yarn. And allows you to scale out across your HEDUP cluster petabytes if you like. What we've added to that solution is now native spark integration and that native spark integration gives you three key things. Number one, zero penalty for real time updates. We're the only ones to the best of our knowledge that can do that. In other words you can update the data and you will not slow down your analytics performance. Every other HEDUP based analytic tool has to, if you will stop the clock. Fresh out the new data to be able to do updates. Because of our architecture and our deep knowledge with transactional processing you don't slow down. That means you can always be assured you'll have fresh data running. The second thing is spark powered direct query access. So we can get at not just Vector formats we have an optimized data format. Which it is the fastest as you'd find in analytic databases but what's so important is you can hit, ORC, Parke and other data file formats through spark and without any transformation. Be it to ingest and analyze an information. The third one and certainly not the least is something that I think you're going to be talking a lot more about. Which is native spark data frame support. Data frames. >> What's the impact of that? >> Well data frames will allow you to be able to talk to spark SQL, spark R based applications. So now that you're not just going to the data you're going to other applications. And that means that you're able to interface directly to the system of record applications that are running. Using this lingua franca of data frames that now has hit a maturity point where you're seeing pretty broad adoption. And by doing native integration with that we've just simplified the ability to connect directly to dozens of enterprise applications and get the information you need. >> Jeff would you be describing what you're offering now. As a form of data, sort of a data virtualization layer that sits in front of all these back end databases. But uses data frames from spark or am I misconstruing. >> Well it's a little less a virtualization layer as maybe a super highway. That we're able to say this analytics tool... You know in the old days it was one of two things. Either you had to do a formal traditional integration and transform that data right so? You had to go from French to German, once it was in German you could read it. Or what you had to do was you had to be able to query and bring in that information. But you had to be able to slow down your performance because that transformation had not occurred. Now what we're able to use is use this park native connector. So you can have the best of both worlds and if you will, it is creating an abstraction layer but it's really for connectivity as opposed to an overall one. What we're not doing is virtualizing the data. That's the key point, there are some people that are pushing data cataloging and cleansing products and abstracting the entire data from you. You're still aware of where the native format is, you're still able to write to it with zero penalty. And that's critical for performance. When you start to build lots of abstraction layers truly traditional ones. You simplify some things but usually you pay a performance penalty. And just to make a point, in the benchmarks we're running compared to Hive and Polor for example. We're used cases against Vector H may take nearly two hours we can do it in less than two minutes. And we've been able to uphold that for over a year. That is because Vector in its core technology has calmer capabilities and, this is a mouthful. But multi level in memory capability. And what does that mean? You ask. >> I was going to ask but keep going. >> I can imagine the performance latency is probably great. I mean you have in memory that everyone kind of wants. >> Well a lot of in memory where it is you used is just held at the RAM level. And it's the ability to breed data in RAM and take advantage of it. And we do that and of course that's a positive but we go down to the cash level. We get down much much lower because we would rather that data be in the CPU if at all possible. And with these high performance cores it's quite possible. So we have some tricks that are special and unique to Vector so that we actually optimize the in memory capability. The other last thing we do is you know HEDUP and HTFS is not particularly smart about where it places the data. And the last thing you want is your data rolling across lots of different data nodes. That just kills performance. What we're able to do is think about the core location of the data. Look at the jobs and look at the performance and we're able to squeeze optimization in there. And that's how we're able to get 50, 100 sometimes an excess of 500 times faster than some of the other well known SQL and HEDUP performances. So that combined now with this spark integration this native spark integration. Means people don't have to do the plumbing they can get out of the basement and up to the first floor. They can take care of, advantage of open source innovation yet get what we're claiming is the fastest HEDUP analytics database in HEDUP. >> So, I got to ask you. I mean you've been, and I mentioned on the intro, industry veteran. CMO, chief marketing officer. I mean challenging with Acting cause there's so many things to focus on. How are you attacking the marketing of Acting because you have a portfolio that hybrid data is a good position. I like that how you bring that to the forefront kind of give it a simple positioning. But as you look at Acting's value proposition and engage you customer base and potentially prospective customers. How are you iterating the marketing message the position and engaging with clients? >> Well it's a fair question and it is daunting when you have multiple products. And you got to have a simple compelling message, less is more to get signal above noise today. At least that's how I feel. So we're hanging our hats on hybrid data. And we're going to take it to the moon or go down with the ship on that. But we've been getting some pretty good feedback. >> What's been the hit one feedback on the hybrid data because, I'm a big fan of hybrid cloud but I've been saying it's a methodology it's not a product. On premise cloud is growing and so is public so hybrid hangs together in the cloud thing. So with data, you're bridging two worlds. Consumption and creation. >> Well what's interesting when you say hybrid data. People put their own definitions around it. In an unaided way and they say you know with all the technology and all the trends, that's actually at the end of the day nets out my situation. I do have data that's hybrid data and it's becoming increasingly more hybrid. And god knows the people that are demanding wanting to use it aren't using it or doing it. And the last thing I need, and I'm really convinced of this. Is a lot of people talk about platforms we love to use the P word. Nobody buys a platform because people are trying to address their use cases. But they don't wat to do it in this siloed kind of brick wall way where I address one use case but it won't function elsewhere. What are they looking for is a collection of best fits solutions that can cooperate together. The secret source for us is we have a cloud control plane. All our technologies, whether it's on premise or in the cloud touch that. And it allows us to orchestrate and do things together. Sometimes it's very intimate and sometimes it's broader. >> Or what exactly is the control plane? >> It does everything from administration, it can do down to billing and it can also be scheduling transactional performance. Now on one extreme we use it for a back up recovery for our transactional database. And we have a cloud based back up recovery service and it all gets administered through the control plane. So it knows exactly when it's appropriate to backup because it understands that database and it takes care of it. It was relatively simple for us to create. On the more intimate sense we were the first company and it was called Acting X which I know we were talking before. We named our product after X before our friends at Apple did. So I like to think we were pioneers. >> San Francisco had the iPhone don't get confused there remember. >> I got to give credit where credit's due. >> And give it up. >> But what Acting X is, and we announced it back in April. Is it takes the same vector technology I just talked about. So it's material and we combined it with our integrated transactional database. Which has over 10,000 users around the world. And what we did is we dropped in this high performance calmer database for free. I'm going to say that again, for free in our transactional part from system. So everyone one of our customers, soon as they upgraded to now Acting X. Got a rocket ship of a calmer high performance database inside their transactional database. The data is fresh, it moves over into the calmer format. And the reporting takes off. >> Jeff to end this statement I'll give you the last word. A lot of people look at Acting also a product I mentioned earlier. Is it product leadership that's winning, is it the values of the customer? Where is Acting and winning for the folks that aren't yet customers that you'd like to talk to. What is the Acting success formula? What's the differentiation, where is it, where does it jump off the page? Is it the product, is it the delivery? Where's the action. >> Is it innovation? >> Well let me tell you about, I would answer with two phrases. First is our tag line, our tag line is "activate your data". And that resonated with a lot of people. A lot of people have a lot of data and we've been in this big data era where people talked about the size of their data. Literally I have 5 petabytes you have 6 petabytes. I think people realized that kind of missed the entire picture. Sometimes smaller data, god forbid 1 terabyte can be amazingly powerful depending on the use case. So it's obviously more than size what it is about is activating it. Are you actually using that data so it's making a meaningful difference. And you're not putting it in a data pond, puddle or lake to be used someday like you're storing it in an attic. There's a lot of data getting dusty in attics today because it is not being activated. And that would bring me to the, not the tag line but what I think what's driving us and why customers are considering us. They see we are about the technology of the future but we're very much about innovation that actually works. Because of our heritage, because we have companies that understand for over 20 years how to run on data. We get what acid compliance is, we get what transactional systems are. We get that you need to be able to not just read but write data. And we bring the methodology to our innovation and so for people, companies, animals, any form of life. That is interested in. >> So it's the product platform that activates and then the result is how you guys roll with customers. >> In the real world today where you can have real concurrency, real enterprise, great performance. Along with the innovation. >> And the hybrid gives them some flexibility that's the new tag line, that's the kind of main. I understand you currently hybrid data means basically flexibility for the customer. >> Yeah it's use the data you need for what you use it for and have the systems work for you. Rather than you work for the systems. >> Okay check out Acting, Jeff Viece friend of the Cube, alumni now. The CMO at Acting, we following your progress so congratulations on the new opportunity. More Cube coverage after this strip break. I'm John Furrier, James Kobielus here inside the Cube in New York City for our BIGDATA NYC event all week. In conjunction with STRATA Data right next door we'll be right back. (tech music)

Published Date : Sep 27 2017

SUMMARY :

Brought to you by SiliconANGLE Media and anyone making shaping the agenda There's so much going on from the large enterprises is the shift to what we refer to as hybrid data. What does that term mean for the people that the net net is you need not one solution so that's the two hybrids. That's the bridge that you need to be able to cross. Been on the Cube many times. and you need to be able to orchestrate across them. So Acting Vector in HEDUP, hence the H. it is becoming the glue. and being able to interface with that. Lets talk about the hard news. and you will not slow down your analytics performance. and get the information you need. Jeff would you be describing and abstracting the entire data from you. I can imagine the performance latency And the last thing you want is your data rolling across I like that how you bring that to the forefront and it is daunting when you have multiple products. on the hybrid data because, and they say you know with all the technology So I like to think we were pioneers. San Francisco had the iPhone And the reporting takes off. is it the values of the customer? We get that you need to be able to not just read and then the result is how you guys roll with customers. where you can have real concurrency, And the hybrid gives them some flexibility and have the systems work for you. Jeff Viece friend of the Cube, alumni now.

ENTITIES

Entity	Category	Confidence
James Kobielus	PERSON	0.99+
Jeff Viece	PERSON	0.99+
Jeff Veis	PERSON	0.99+
John Furrier	PERSON	0.99+
April	DATE	0.99+
Jeff	PERSON	0.99+
New York City	LOCATION	0.99+
6 petabytes	QUANTITY	0.99+
two	QUANTITY	0.99+
Apple	ORGANIZATION	0.99+
one	QUANTITY	0.99+
HPE	ORGANIZATION	0.99+
5 petabytes	QUANTITY	0.99+
dozens	QUANTITY	0.99+
less than two minutes	QUANTITY	0.99+
50	QUANTITY	0.99+
Midtown Manhattan	LOCATION	0.99+
First	QUANTITY	0.99+
STRATA Data	ORGANIZATION	0.99+
SiliconANGLE Media	ORGANIZATION	0.99+
1 terabyte	QUANTITY	0.99+
two phrases	QUANTITY	0.99+
first floor	QUANTITY	0.99+
over 20 years	QUANTITY	0.99+
iPhone	COMMERCIAL_ITEM	0.99+
Vector	ORGANIZATION	0.99+
Linux	TITLE	0.99+
both sides	QUANTITY	0.99+
one sense	QUANTITY	0.99+
Acting X	TITLE	0.99+
San Francisco	LOCATION	0.98+
over a year	QUANTITY	0.98+
Windows	TITLE	0.98+
Cube	ORGANIZATION	0.98+
third one	QUANTITY	0.98+
Today	DATE	0.98+
500 times	QUANTITY	0.98+
today	DATE	0.98+
NYC	LOCATION	0.98+
over 10,000 users	QUANTITY	0.98+
three data scientists	QUANTITY	0.98+
two worlds	QUANTITY	0.98+
three pillars	QUANTITY	0.98+
hundreds of people	QUANTITY	0.98+
Tableau	TITLE	0.97+
second thing	QUANTITY	0.97+
STRATA HEDUP	EVENT	0.97+
two hours	QUANTITY	0.97+
both worlds	QUANTITY	0.96+
HEDUP	ORGANIZATION	0.96+
SQL	TITLE	0.96+
two things	QUANTITY	0.96+
a dozen	QUANTITY	0.95+
one data source	QUANTITY	0.95+
first company	QUANTITY	0.95+
one solution	QUANTITY	0.94+
100	QUANTITY	0.93+
BIGDATA	ORGANIZATION	0.91+
two hybrids	QUANTITY	0.9+
BIGDATA	EVENT	0.9+
STRATA DATA	ORGANIZATION	0.9+
2017	DATE	0.89+
Vector H	TITLE	0.88+
spark	ORGANIZATION	0.88+
HEDUP	TITLE	0.88+
German	LOCATION	0.87+
one extreme	QUANTITY	0.86+
four walls	QUANTITY	0.86+
dozens of enterprise applications	QUANTITY	0.85+
single	QUANTITY	0.84+
Acting X.	TITLE	0.82+
three key things	QUANTITY	0.8+

Itamar Ankorian, Attunity | BigData NYC 2017

>> Announcer: Live from Midtown Manhattan, it's theCUBE, covering Big Data New York City 2017. Brought to you by SiliconANGLE Media and its ecosystem sponsor. >> Okay, welcome back, everyone, to our live special CUBE coverage in New York City in Manhattan, we're here in Hell's Kitchen for theCUBE's exclusive coverage of our Big Data NYC event and Strata Data, which used to be called Strata Hadoop, used to be Hadoop World, but our event, Big Data NYC, is our fifth year where we gather every year to see what's going on in big data world and also produce all of our great research. I'm John Furrier, the co-host of theCUBE, with Peter Burris, head of research. Our next guest, Itamar Ankorion, who's the Chief Marketing Officer at Attunity. Welcome back to theCUBE, good to see you. >> Thank you very much. It's good to be back. >> We've been covering Attunity for many, many years. We've had many conversations, you guys have had great success in big data, so congratulations on that. But the world is changing, and we're seeing data integration, we've been calling this for multiple years, that's not going away, people need to integrate more. But with cloud, there's been a real focus on accelerating the scale component with an emphasis on ease of use, data sovereignty, data governance, so all these things are coming together, the cloud has amplified. What's going on in the big data world, and it's like, listen, get movin' or you're out of business has pretty much been the mandate we've been seeing. A lot of people have been reacting. What's your response at Attunity these days because you have successful piece parts with your product offering? What's the big update for you guys with respect to this big growth area? >> Thank you. First of all, the cloud data lakes have been a major force, changing the data landscape and data management landscape for enterprises. For the past few years, I've been working closely with some of the world's leading organizations across different industries as they deploy the first and then the second and third iteration of the data lake and big data architectures. And one of the things, of course, we're all seeing is the move to cloud, whether we're seeing enterprises move completely to the cloud, kind of move the data lakes, that's where they build them, or actually have a hybrid environment where part of the data lake and data works analytics environment is on prem and part of it is in the cloud. The other thing we're seeing is that the enterprises are starting to mix more of the traditional data lake, the cloud is the platform, and streaming technologies is the way to enable all the modern data analytics that they need, and that's what we have been focusing on on enabling them to use data across all these different technologies where and when they need it. >> So, the sum of the parts is worth more if it's integrated together seems to be the positioning, which is great, it's what customers want, make it easier. What is the hard news that you guys have, 'cause you have some big news? Let's get to the news real quick. >> Thank you very much. We did, today, we have announced, we're very excited about it, we have announced a new big release of our data integration platform. Our modern platform brings together Attunity Replicate, Attunity Compose for Hive, and Attunity Enterprise Manager, or AEM. These are products that we've evolved significantly, invested a lot over the last few years to enable organizations to use data, make data available, and available in the real time across all these different platforms, and then, turn this data to be ready for analytics, especially in Hive and Hadoop environments on prem and now also in the cloud. Today, we've announced a major release with a lot of enhancements across the entire product line. >> Some people might know you guys for the Replicate piece. I know that this announcement was 6.0, but as you guys have the other piece part to this, really it's about modernization of kind of old-school techniques. That's really been the driver of your success. What specifically in this announcement makes it, you know, really work well for people who move in real time, they want to have good data access. What's the big aha for the customers out there with Attunity on this announcement? >> That's a great question, thank you. First of all is that we're bringing it all together. As you mentioned, over the past few years, Attunity Replicate has emerged as the choice of many Fortune 100 and other companies who are building modern architectures and moving data across different platforms, to the cloud, to their lakes, and they're doing it in a very efficient way. One of the things we've seen is that they needed the flexibility to adapt as they go through their journey, to adapt different platforms, and what we give them with Replicate was the flexibility to do so. We give them the flexibility, we give them the performance to get the data and efficiency to move only the change of the data as they happen and to do that in a real-time fashion. Now, that's all great, but once the data gets to the data lake, how do you then turn it into valuable information? That's when we introduced Compose for Hive, which we talked about in our last session a few month ago, which basically takes the next stage in the pipeline picking up incremental, continuous data that is fed into the data lake and turning those into operational data store, historical data stores, data store that's basically ready for analytics. What we've done with this release that we're really excited about is putting all of these together in a more integrated fashion, putting Attunity Enterprise Manager on top of it to help manage larger scale environments so customers can move faster in deploying these solutions. >> As you think about the role that Attunity's going to play over time, though, it's going to end up being part of a broader solution for how you handle your data. Imagine for a second the patterns that your customers are deploying. What is Attunity typically being deployed with? >> That's a great question. First of all, we're definitely part of a large ecosystem for building the new data architecture, new data management with data integration being more than ever a key part of that bigger ecosystem because as all they actually have today is more islands with more places where the data needs to go, and to your point, more patterns in which the data moves. One of those patterns that we've seen significantly increase in demand and deployment is streaming. Where data used to be batch, now we're all talking about streaming. Kafka has emerged as a very common platform, but not only Kafka. If you're on Amazon Web Services, you're using Kinesis. If you're in Azure, you're using Azure Event Hubs. You have different streaming technologies. That's part of how this has evolved. >> How is that challenge? 'Cause you just bring up a good point. I mean, with the big trend that customers want is they want either the same code basis on prem and that they have the hybrid, which means the gateway, if you will, to the public cloud. They want to have the same code base, or move workloads between different clouds, multi-cloud, it seems to be the Holy Grail, we've identified it. We are taking the position that we think multi-cloud will be the preferred architecture going forward. Not necessarily this year, but it's going to get there. But as a customer, I don't want to have to rebuild employees and get skill development and retraining on Amazon, Azure, Google. I mean, each one has its own different path, you mentioned it. How do you talk to customers about that because they might be like, whoa, I want it, but how do I work in that environment? You guys have a solution for that? >> We do, and in fact, one of the things we've seen, to your point, we've seen the adoption of multiple clouds, and even if that adoption is staged, what we're seeing is more and more customers that are actually referring to the term lock-in in respect to the cloud. Do we put all the eggs in one cloud, or do we allow ourselves the flexibility to move around and use different clouds, and also mitigate our risk in that respect? What we've done from that perspective is first of all, when you use the Attunity platform, we take away all the development complexity. In the Attunity platform, it is very easy to set up. Your data flow is your data pipelines, and it's all common and consistent. Whether you're working on prem, whether you work on Amazon Web Services, on Azure, or on Google or other platforms, it all looks and feels the same. First of all, and you solve the issue of the diversity, but also the complexity, because what we've done is, this is one of the big things that Attunity is focused on was reducing the complexity, allowing to configure these data pipelines without development efforts and resources. >> One of the challenges, or one of the things you typically do to take complexity out is you do a better job of design up front. And I know that Attunity's got a tool set that starts to address some of of these things. Take us a little bit through how your customers are starting to think in terms of designing flows as opposed to just cobbling together things in a bespoke way. How is that starting to change as customers gain experience with large data sets, the ability, the need to aggregate them, the ability to present them to developers in different ways? >> That's a great point, and again, one of the things we've focused on is to make the process of developing or configuring these different data flows easy and modular. First, while in Attunity you can set up different flows in different patterns, and you can then make them available to others for consumption. Some create the data ingestion, or some create the data ingestion and then create a data transformation with Compose for Hive, and with Attunity Enterprise Manager, we've now also introduced APIs that allow you to create your own microservices, consuming and using the services enabled by the platform, so we provide more flexibility to put all these different solutions together. >> What's the biggest thing that you see from a customer standpoint, from a problem that you solve? If you had to kind of lay it out, you know the classic, hey, what problem do you solve? 'Cause there are many, so take us through the key problem, and then, if there's any secondary issues that you guys can address customers, that seems the way conversation starts. What are key problems that you solve? >> I think one of the major problems that we solve is scale. Our customers that are deploying data lakes are trying to deploy and use data that is coming, not from five or 10 or even 50 data sources, we work at hundreds going on thousands of data sources now. That in itself represents a major challenge to our customers, and we're addressing it by dramatically simplifying and making the process of setting those up very repeatable, very easy, and then providing the management facility because when you have hundreds or thousands, management becomes a bigger issue to operationalize it. We invested a lot in a management facility for those, from a monitoring, control, security, how do you secure it? The data lake is used by many different groups, so how do we allow each group to see and work only on what belongs to that group? That's part it, too. So again, the scale is the major thing there. The other one is real timeliness. We talked about the move to streaming, and a lot of it is in order to enable streaming analytics, real-time analytics. That's only as good as your data, so you need to capture data in real time. And that of course has been our claim to fame for a long time, being the leading independent provider of CDC, change data capture technology. What we've done now, and also expanded significantly with the new release, version six, is creating universal database streaming. >> What is that? >> We take databases, we take databases, all the enterprise databases, and we turn them into live streams. When you think, by the way, by the most common way that people have used, customers have used to bring data into the lake from a database, it was Scoop. And Scoop is a great, easy software to use from an open source perspective, but it's scripting and batch. So, you're building your new modern architecture with the two are effectively scripting and batch. What we do with CDC is we enable to take a database, and instead of the database being something you come to periodically to read it, we actually turn it into a live feed, so as the data changes in the database, we stream it, we make it available across all these different platforms. >> Changes the definition of what live streaming is. We're live streaming theCUBE, we're data. We're data streaming, and you get great data. So, here's the question for you. This is a good topic, I love this topic. Pete and I talk about this all the time, and it's been addressed in the big data world, but it's kind of, you can see the pattern going mainstream in society globally, geopolitically and also in society. Batch processing and data in motion are real time. Streaming brings up this use case to the end customer, which is this is the way they've done it before, certainly store things in data lakes, that's not going to go away, you're going to store stuff, but the real gain is in motion. >> Itamar: Correct. >> How do you describe that to a customer when you go out and say, hey, you know, you've been living in a batch world, but wake up to the real world called real time. How do you get to them to align with it? Some people get it right away, I see that, some people don't. How do you talk about that because that seems to be a real cultural thing going on right now, or operational readiness from the customer standpoint? Can you just talk through your feeling on that? >> First of all, this often gets lost in translation, and we see quite a few companies and even IT departments that when you talk, when they refer to real time, or their business tells them we need real time, what they understand from it is when you ask for the data, the response will be immediate. You get real time access to the data, but the data is from last week. So, we get real time access, but for last week's data. And that's what we try to do is to basically say, wait a second, when you mean real time, what does real time mean? And we start to understand what is the meaning of using last week's data versus, or yesterday's data, over the real time data, and that makes a big difference. We actually see that today the access, the availability, the availability to act on the real time data, that's the frontier of competitive differentiation. That's what makes a customer experience better, that's what makes the business more operationally efficient than the competition. >> It's the data, not so much the process of what they used to do. They're version of real time is I responded to you pretty quickly. >> Exactly, the other thing that's interesting is because we see it with, again, change of the capture becoming a critical component of the modern data architecture. Traditionally, we used to talk about different type of tools and technology, now CDC itself is becoming a critical part of it, and the reason is that it serves and it answers a lot of fundamental needs that are now becoming critical. One is the need for real-time data. The other one is efficiency. If you're moving to the cloud, and we talked about this earlier, if you're data lake is going to be in the cloud, there's no way you're going to reload all your data because the bandwidth is going to get in the way. So, you have to move only the delta. You need the ability to capture and move only the delta, so CDC becomes fundamental both in enabling the real time as well the efficient, the low-impact data integration. >> You guys have a lot of partners, technology partners, global SIs, resellers, a bunch of different partnership levels. The question I have for you, love to get your reaction and share your insight into is, okay, as the relationship to the customer who has the problem, what's in it for me? I want to move my business forward, I want to do digital business, I need to get up my real-time data as it's happening. Whether it's near real time or real time, that's evolution, but ultimately, they have to move their developers down a certain path. They'll usually hire a partner. The relationship between partners and you, the supplier to the customer, has changed recently. >> That's correct. >> How is that evolving? >> First of all, it's evolving in several ways. We've invested on our part to make sure that we're building Attunity as a leading vendor in the ecosystem of they system integration consulting companies. We work with pretty much all the major global system integrators as well as regional ones, boutique ones, that focus on the emerging technologies as well as get the modern analytic-type platforms. We work a lot with plenty of them on major corporate data center-level migrations to the cloud. So again, the motivations are different, but we invest-- >> More specialized, are you seeing more specialty, what's the trend? >> We've been a technology partner of choice to both Amazon and Microsoft for enabling, facilitating the data migration to the cloud. They of course, their select or preferred group of partners they work with, so we all come together to create these solutions. >> Itamar, what's the goals for Attunity as we wrap up here? I give you the last word, as you guys have this big announcement, you're bringing it all together. Integrating is key, it's always been your ethos in the company. Where is this next level, what's the next milestone for you guys? What do you guys see going forward? >> First of all, we're going to continue to modernize. We're really excited about the new announcement we did today, Replicate six, AEM six, a new version of Compose for Hive that now also supports small data lakes, Aldermore, Scaldera, EMR, and a key point for us was expanding AEM to also enable analytics on the data we generate as data flows through it. The whole point is modernizing data integration, providing more intelligence in the process, reducing the complexity, and facilitating the automation end-to-end. We're going to continue to solve, >> Automation big, big time. >> Automation is a big thing for us, and the point is, you need to scale. In order to scale, we want to generate things for you so you don't to develop for every piece. We automate the automation, okay. The whole point is to deliver the solution faster, and the way we're going to do it is to continue to enhance each one of the products in its own space, if it's replication across systems, Compose for Hive for transformations in pipeline automation, and AEM for management, but also to create integration between them. Again, for us it's to create a platform that for our customers they get more than the sum of the parts, they get the unique capabilities that we bring together in this platform. >> Itamar, thanks for coming onto theCUBE, appreciate it, congratulations to Attunity. And you guys bringing it all together, congratulations. >> Thank you very much. >> This theCUBE live coverage, bringing it down here to New York City, Manhattan. I'm John Furrier, Peter Burris. Be right back with more after this short break. (upbeat electronic music)

Published Date : Sep 27 2017

SUMMARY :

Brought to you by SiliconANGLE Media I'm John Furrier, the co-host of theCUBE, Thank you very much. What's the big update for you guys the move to cloud, whether we're seeing enterprises What is the hard news that you guys have, and available in the real time That's really been the driver of your success. the flexibility to adapt as they go through their journey, Imagine for a second the patterns and to your point, more patterns in which the data moves. We are taking the position that we think multi-cloud We do, and in fact, one of the things we've seen, the ability to present them to developers in different ways? one of the things we've focused on is What's the biggest thing that you see We talked about the move to streaming, and instead of the database being something and it's been addressed in the big data world, or operational readiness from the customer standpoint? the availability to act on the real time data, I responded to you pretty quickly. because the bandwidth is going to get in the way. the supplier to the customer, has changed boutique ones, that focus on the emerging technologies facilitating the data migration to the cloud. What do you guys see going forward? on the data we generate as data flows through it. and the point is, you need to scale. And you guys bringing it all together, congratulations. it down here to New York City, Manhattan.

ENTITIES

Entity	Category	Confidence
Microsoft	ORGANIZATION	0.99+
Itamar Ankorion	PERSON	0.99+
Peter Burris	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
hundreds	QUANTITY	0.99+
John Furrier	PERSON	0.99+
five	QUANTITY	0.99+
last week	DATE	0.99+
New York City	LOCATION	0.99+
Itamar	PERSON	0.99+
second	QUANTITY	0.99+
CDC	ORGANIZATION	0.99+
Google	ORGANIZATION	0.99+
first	QUANTITY	0.99+
Today	DATE	0.99+
Pete	PERSON	0.99+
50 data sources	QUANTITY	0.99+
10	QUANTITY	0.99+
Itamar Ankorian	PERSON	0.99+
two	QUANTITY	0.99+
SiliconANGLE Media	ORGANIZATION	0.99+
Amazon Web Services	ORGANIZATION	0.99+
each group	QUANTITY	0.99+
yesterday	DATE	0.99+
fifth year	QUANTITY	0.99+
One	QUANTITY	0.99+
today	DATE	0.99+
First	QUANTITY	0.99+
Attunity Replicate	ORGANIZATION	0.99+
Manhattan	LOCATION	0.99+
one	QUANTITY	0.98+
Midtown Manhattan	LOCATION	0.98+
NYC	LOCATION	0.98+
Attunity	ORGANIZATION	0.97+
Aldermore	ORGANIZATION	0.97+
both	QUANTITY	0.97+
one cloud	QUANTITY	0.97+
this year	DATE	0.97+
EMR	ORGANIZATION	0.96+
Big Data	EVENT	0.96+
Kafka	TITLE	0.95+
each one	QUANTITY	0.95+
Scaldera	ORGANIZATION	0.95+
thousands	QUANTITY	0.94+
Azure	ORGANIZATION	0.94+
Strata Hadoop	EVENT	0.94+
New York City, Manhattan	LOCATION	0.94+
6.0	QUANTITY	0.93+
theCUBE	ORGANIZATION	0.93+
Azure Event Hubs	TITLE	0.91+
2017	EVENT	0.91+
a second	QUANTITY	0.91+
Hive	TITLE	0.9+
rtune 100	ORGANIZATION	0.9+
CUBE	ORGANIZATION	0.9+
few month ago	DATE	0.88+
Attunity Enterprise Manager	TITLE	0.83+
thousands of data sources	QUANTITY	0.83+
2017	DATE	0.82+
AEM	TITLE	0.8+
third iteration	QUANTITY	0.79+
version six	QUANTITY	0.78+

Tim Smith, AppNexus | BigData NYC 2017

>> Announcer: Live, from Midtown Manhattan, it's theCUBE. Covering Big Data, New York City, 2017. Brought to you by SiliconANGLE Media and its ecosystem sponsors. >> Okay welcome back, everyone. Live in Manhattan, New York City, in Hell's Kitchen, this is theCUBE's special event, our annual CUBE-Wikibon Research Big Data event in Manhattan. Alongside Strata, Hadoop; formerly Hadoop World, now called Strata Data, as the world continues. This is our annual event; it's our fifth year here, sixth overall, wanted to kind of move from uptown. I'm John Furrier, the co-host of theCUBE, with Peter Burris, Head of Research at SiliconANGLE and GM of Wikibon Research. Our next guest is Tim Smith, who's the SVP of technical operations at AppNexus; technical operations for large scale is an understatement. But before we get going; Tim, just talk about what AppNexus as a company, what you guys do, what's the core business? >> Sure, AppNexus is the second largest digital advertising marketplace after google. We're an internet technology company that harnessed, we harness data and machine learning to power the companies that comprise the open internet. We began by building a powerful technology platform, in which we embedded core capabilities, tools and features. With me so far? >> Yeah, we got it. >> Okay, on top of that platform, we built a core suite of cloud-based enterprise products that enable the buying and selling of digital advertising, and a scale-transparent and low-cost marketplace where other companies can transact; either using our enterprise products, or those offered by other companies. If you want to hear a little about the daily peaks, peak feeds and speeds, it is Strata, we should probably talk about that. We do about 11.8 billion impressions transacted on a daily basis. Each of those is a real-time auction conducted in a fraction of a second, well under half a second. We see about 225 billion impressions per day, and we handle about 5 million queries per second at peak load. We produce about 150 terabytes of data each day, and we move about 400 gigabits into and out of the internet at peak, all those numbers are daily peaks. Makes sense? >> Yep. >> Okay, so by way of comparison, which might be useful for people, I believe the NYSE currently does roughly 2 million trades per day. So if we round that up to 3 million trades a day and assume the NYSE were to conduct that volume every single day of the year; 7 days a week, 365 days a year, that'd be about a billion trades a year. Similarly, I believe Visa did about 28-and-a-half billion transactions in their fiscal third quarter. I'll round that up to 30 billion, and average it out to about 333 million transactions per day and annualize it to about 4 billion transactions per year. Little bit of math, but as I mentioned, AppNexus does an excess of 10 billion transactions per day. And so it seems reasonable to say that AppNexus does roughly 10 times the transaction volume in one day, than the NYSE does in a year. And similarly, it seems reasonable to say that AppNexus daily does more than two times the transaction volume that Visa does in a year. Obviously, these are all just very rough numbers based on publicly available information about the NYSE and Visa, and both the NYSE and Visa do far, far more volume than AppNexus when measured in terms of dollars. So given our volumes, it's imperative that AppNexus does each transaction with the maximum efficiency and lowest reasonable possible cost, and that is one of the most challenging aspects of my job. >> So thanks for spending the time to give the overview. There's a lot of data; I mean 10 billion a day is massive volume. I mean the internet, and you see the scale, is insane. We're in a new era right now of web-scale. We've seen it in Facebook, and it's enormous. It's only going to get bigger, right? So on the online ad tech, you guys are essentially doing like a Google model, that's not everything but Google, which is still huge numbers. Then you include Microsoft and everybody else. Really heavy lifting, IT-like situation. What's the environment like? And just talk about, you know, what's it like for you guys. Because you got a lot of opp's, I mean terms of dev opp's. You can't break anything, because that 10 billion transaction or near, it's a significant impact. So you have to have everything buttoned-up super tight, yet you got to innovate and grow with the future growth. What's the IT environment like? >> It's interesting. We have about 8,000 servers spread across about seven data centers on three continents, and we run, as you mentioned, around the clock. There's no closing bell; downtime is not acceptable. So when you look at our environment, you're talking about four major categories of server complexes. We have real-time processing, which is the actual ad serving. We have a data pipeline, which is what we call our big data environment. We also have client-facing environment and an infrastructure environment. So we use a lot of different tools and applications, but I think the most relevant ones to this discussion are Hadoop and its friends HDFS, and Hive and Spark. And then we use the Vertica Analytics Platform. And together Hadoop and its friends, and Vertica comprise our entire data pipeline. They're both very disk-intensive. They're cluster based applications, and it's a lot of challenge to keep them up and running. >> So what are some of those challenges? Just explain a little bit, because you also have a lot of opportunity. I mean, it's money flowing through the air, basically; digital air, if you will. I mean, they got a lot of stuff happening. Take us through the challenges. >> You know, our biggest apps are all clustered. And all of our clusters are built with commodity servers, just like a lot of other environments. The big data app clusters traditionally have had internal disks, while almost all of our other servers are very light on disk. One of the biggest challenges is, since the server is the fundamental building block of a cluster, then regardless of whether you need more compute or more storage, you always have to add more servers to get it. That really limits flexibility and creates a lot of inefficiencies, and I really, really am obsessive about reducing and eliminating inefficiencies. So, with me so far? >> Yep. >> Great. The inefficiencies result from two major factors. First, not all workloads require the same ratio of compute to storage. Some workloads are more compute-intensive, and others are really less dependent on storage, while other workloads require a lot more storage. So we have to use standard server configurations and as a result, we wind up with underutilized compute and storage. This is undesirable, it's inefficient, yet given our scale, we have to use standardized configurations. So that's the first big challenge. The second is the compute to disk ratio. It's generally fixed when you buy the servers. Yes, we can certainly add more disks in the field, but that's a labor intensive, and it's complicated from a logistics and an asset management standpoint, and you're fundamentally limited by the number of disk slots in the server. So now you're right back into the trap of more storage requires more servers, regardless of whether you need more compute or not. And then you compound the inefficiencies. >> Couldn't you just move the resources from, unused resources, from one cluster to the other? >> I've been asked that a lot; and no, it's just not that simple. Each application cluster becomes a silo due to its configuration of storage and compute. This means you just can't move servers from clusters because the clusters are optimized for the workloads, and the fact that you can't move resources from one cluster to another, it's more inefficiencies. And then they're compounded over time since workloads change, and the ideal ratio of compute-to-storage changes. And the end result is unused resources trapped in silos and configurations that are no longer optimized for your workload. And there's only really one solution that we've been able to find. And to paraphrase an orator far, far more talented than I am, namely Ronald Reagan, we need to open this gate, tear down these silos. The silos just have to go away. They fundamentally limit flexibility and efficiency. >> What were some of the other issues caused by using servers with internal drives? >> You have more maintenance, you've got to deal with the logistics. But the biggest problem is service and storage have significantly different life cycles. Servers typically have a three year life cycle before they're obsolete. Storage typically is four to six years. You can sometimes stretch that a little further with the storage. Inside the servers that are replaced every 3 years, we end up replacing storage before the end of its effective lifetime; that's inefficient. Further, since the storage is inside the servers, we have to do massive data migrations when we replace servers. Migrations, they're time consuming, they're logistically difficult, and they're high risk. >> So how did DriveScale help you guys? Because you guys certainly have a challenging environment, you laid out the the story, and we appreciate that. How did DriveScale help you with the challenges? >> Well, what we really wanted to do was disaggregate storage from servers, and DriveScale enables us to do that. Disaggregating resources is a new term in the industry, but I think lot of people are focusing on it. I can explain it if you think that would make sense. >> What do you mean by disaggregating resources? Can you explain that, and how it works? >> Sure, so instead of buying servers with internal drives, we now buy diskless servers with JBODs. And DriveScale lets us easily compose servers with whatever amount of disk storage we need, from the server resource pool and the disk resource pool; and they're separate pools. This means we have the right balance of compute and storage for each workload, and we can easily adjust it over time. And all of this is done via software, so it's easy to do with a GUI or in our case, at our scale, scripting. And it's done on demand, and it's much more efficient. >> How does it help you with the underutilized resource challenge you mentioned earlier? >> Well, since we can add and remove resources from each cluster, we can manage exactly how much compute power and storage is deployed for each workload. Since this is all done via software, it can be done quickly and easily. We don't have to send a technician into a data center to physically swap drives, add drives, move drives. It's all done via software and it's very, very efficient. >> Can you move resources between silos? >> Well, yes and no. First off, our goal is no more silos. That said, we still have clusters, and once we completely migrate to DriveScale, all of our compute and storage resources will be consolidated into just a few common pools. And disk storage will no longer differentiate pools; thus, we have fewer pools. For more, we have fewer pools and can use the resources in each pool for more workloads. And when our needs change and they always do, we can reallocate resources as needed. >> What of the life cycle management challenge? How you guys address that? >> Well that's addressed with DriveScale. The compute and the storage are now disaggregated or separated into diskless servers and JBODs, so we can upgrade one without touching the other. We want to upgrade servers to take advantage of new processors or new memory architectures, we just replace the servers, re-combine the disks with the new servers, and we're back up and operating. It saves the cost of buying new disks when we don't need to, and it also simplifies logistics and reduces risk, as we no longer have to run the old plant and the new plant concurrently, and do a complicated data migration. >> What about this qualifying server and storage vendors? Do you still do that? Or how's that impact -- >> We actually don't have to do it. We're still using the same server vendor. We've used Dell for many, many years, we continue to use them. We are using them for storage and there was no real work, we just had to add DriveScale into the mix. >> What's it like working with DriveScale? >> They're really wonderful to work with. They have a really seasoned team. They were at Sun Microsystems and Cisco, they built some of the really foundational products that changed the internet, that the internet was built on. They're really talented, they really bright, and they're really focused on customer success. >> Great story, thanks for sharing that. My final question for you is, you guys have a very big, awesome environment, you've got a lot of scale there. It's great for a startup to get into an environment like this, because one, they could get access to the data, work with a good team like you have. What's it like working with a startup? >> You know it's always challenging at first; too many things to do. >> They got talented guys. Most of the startups, those early day startups, they got all their A players out there. >> They have their A players, and we've been very pleased working with them. We're dealing with the top talent, some of the top talent in the industry, that created the industry. They have a proven track record. We really don't have any concerns, we know they're committed to our success and they have a great team, and great investors. >> A final, final question. For your friends out there are watching, and other practitioners who are trying to run things at scale with a cloud. What's your advice to them? You've been operating at scale, and a lot of, billions of transactions, I mean huge; it's only going to get bigger. Put your IT friendly advice hat on. What's the mindset of operators out there, technical op's, as dev op's comes in seeing a lot of that. What do people need to be thinking about to run at scale? >> There's no magic silver bullet. There's no magic answers. The public cloud is very helpful in a lot of ways, but you really have to think hard about your economics, you have to think about your scale. You just have to be sure that you're going into each decision knowing that you've looked at the costs and the benefits, the performance, the risks, and you don't expect there to be simple answers. >> Yeah, there's no magic beans as they say. You've got to make it work for the business. >> No magic beans, I wish there were. >> Tim, thanks so much for the story. Appreciate the commentaries. Live coverage at Big Data NYC, it's theCUBE. Be back with more after this short break. (upbeat techno music)

Published Date : Sep 27 2017

SUMMARY :

Brought to you by SiliconANGLE Media and GM of Wikibon Research. Sure, AppNexus is the second largest of the internet at peak, all those numbers are daily peaks. and that is one of the most challenging aspects of my job. I mean the internet, and you see the scale, is insane. and we run, as you mentioned, around the clock. because you also have a lot of opportunity. One of the biggest challenges is, The second is the compute to disk ratio. and the fact that you can't move resources Further, since the storage is inside the servers, Because you guys certainly have a challenging environment, I can explain it if you think that would make sense. and we can easily adjust it over time. We don't have to send a technician into a data center and once we completely migrate to DriveScale, and the new plant concurrently, We actually don't have to do it. that changed the internet, that the internet was built on. you guys have a very big, awesome environment, You know it's always challenging at first; Most of the startups, those early day startups, that created the industry. What's the mindset of operators out there, and you don't expect there to be simple answers. You've got to make it work for the business. Tim, thanks so much for the story.

ENTITIES

Entity	Category	Confidence
NYSE	ORGANIZATION	0.99+
Cisco	ORGANIZATION	0.99+
Peter Burris	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
John Furrier	PERSON	0.99+
Sun Microsystems	ORGANIZATION	0.99+
Tim Smith	PERSON	0.99+
four	QUANTITY	0.99+
Dell	ORGANIZATION	0.99+
Manhattan	LOCATION	0.99+
AppNexus	ORGANIZATION	0.99+
SiliconANGLE	ORGANIZATION	0.99+
Tim	PERSON	0.99+
Ronald Reagan	PERSON	0.99+
10 times	QUANTITY	0.99+
Visa	ORGANIZATION	0.99+
three year	QUANTITY	0.99+
one day	QUANTITY	0.99+
First	QUANTITY	0.99+
fifth year	QUANTITY	0.99+
second	QUANTITY	0.99+
Google	ORGANIZATION	0.99+
each workload	QUANTITY	0.99+
One	QUANTITY	0.99+
each cluster	QUANTITY	0.99+
google	ORGANIZATION	0.99+
Wikibon Research	ORGANIZATION	0.99+
sixth	QUANTITY	0.99+
six years	QUANTITY	0.99+
one	QUANTITY	0.99+
each pool	QUANTITY	0.99+
Midtown Manhattan	LOCATION	0.99+
fiscal third quarter	DATE	0.99+
Each	QUANTITY	0.99+
7 days a week	QUANTITY	0.99+
one solution	QUANTITY	0.99+
each transaction	QUANTITY	0.98+
one cluster	QUANTITY	0.98+
365 days a year	QUANTITY	0.98+
Facebook	ORGANIZATION	0.98+
each day	QUANTITY	0.98+
a year	QUANTITY	0.98+
10 billion a day	QUANTITY	0.98+
Hell's Kitchen	LOCATION	0.98+
three continents	QUANTITY	0.98+
both	QUANTITY	0.98+
about 28-and-a-half billion transactions	QUANTITY	0.98+
about 150 terabytes	QUANTITY	0.97+
Manhattan, New York City	LOCATION	0.97+
more than two times	QUANTITY	0.97+
Big Data	ORGANIZATION	0.97+
New York City	LOCATION	0.97+
two major factors	QUANTITY	0.97+
about 11.8 billion impressions	QUANTITY	0.96+
about 8,000 servers	QUANTITY	0.96+
about 400 gigabits	QUANTITY	0.96+
Each application cluster	QUANTITY	0.96+
billions	QUANTITY	0.96+
up to 30 billion	QUANTITY	0.96+
NYC	LOCATION	0.95+
under half a second	QUANTITY	0.94+
Strata Data	EVENT	0.93+
each decision	QUANTITY	0.92+
SiliconANGLE Media	ORGANIZATION	0.92+
2017	DATE	0.91+
Vertica	ORGANIZATION	0.91+
about 4 billion transactions per year	QUANTITY	0.9+
Spark	TITLE	0.9+
theCUBE	ORGANIZATION	0.9+
about a billion trades a year	QUANTITY	0.9+
up to 3 million trades a day	QUANTITY	0.9+
10 billion transaction	QUANTITY	0.88+
DriveScale	ORGANIZATION	0.88+
about 333 million transactions per day	QUANTITY	0.87+
Hive	TITLE	0.87+
HDFS	TITLE	0.87+
CUBE-Wikibon Research Big Data	EVENT	0.86+
DriveScale	TITLE	0.86+
10 billion transactions per day	QUANTITY	0.86+
GM	PERSON	0.83+
2 million trades per day	QUANTITY	0.82+

Kickoff | AWS Summit 2017

>> Announcer: Live from Manhattan it's the Cube. Covering AWS Summit New York City 2017. Brought to you buy Amazon Web Services. >> Hello and welcome to the Big Apple. AWS Summit kicking off here at the Javits Convention Center New York, New York. Along with Stu Miniman, I'm John Walls, welcome to the Cube as we continue our coverage here. Really I feel like this is ongoing, Stu, as far as what we're doing with AWS (mumbles) public sector summit. AWS from the outside in for a very long time. So tell me what you make of this. I mean regional show, we probably have four or 5,000 folks here, good turnout. What's the vibe you got, what's the feeling? >> It's really interesting 'cause we've covered a few of the regional summits but it's the first one that I've attended. I'm actually already have been starting to plan for AWS reinvent, which is the big show in November. Expecting probably around 50,000 people at that show, but I think four years ago, four and a half years ago when I went to the first (mumbles) summit in Las Vegas, it was about the size of what this show is. So Adrian Cockcroft got up on stage, said there were about 20,000 people registered. Of course registered doesn't mean that they're all here. A lot of people I know watching the live stream as well as it's free to attend so if I'm in New York City, there's just a few people in New York that care about tech probably. So maybe they'll pop in sometime for today, but in the keynote there's definitely a few thousand people. It's a good sized expo hall here. This could be a five or 6,000 person event for the size of the expo hall that they have here, and the Javits center can really hold some big activity here. Impressive at scope because Amazon and the cloud is still in early days. As Jeff (mumbles) says, there is no day two, we're always day one and what's going on. Went through a lot of announcements, a lot of momentum, a lot of revenue in this big cloud thing. >> You talk about Adrian too, we'll get to his keynote comments in a little bit. Talking about revenue growth still in the uptick year to year 42%. So still going there, but then on the other side you do se some writing going on that maybe upticks slowing down just a hair as far as cloud deployment goes. >> Yeah that's a great thing, 'cause we're all staring at the numbers and it's no longer, Amazon right now is not growing 75, 80% as opposed to the companies trying to catch up to them, like Microsoft, is growing at more of that 75 (talking over each other) >> But Amazon if you look at infrastructured service, is the largest out there. What was it, it was a 16 billion dollar run rate looking at the last 12 months looking back. Still over 40% growth rate. So yes is the growth slowing down a little bit, but that's just because they're not at a big number so it's a little tougher, but they keep adding services, they keep adding users. Some big users up on stage, some new services getting announced because the way Andy Jassy puts it, I mean everyday when you wake up, there's another three services from Amazon. So it's not like they had to say, oh geeze, can we hold something off? I go to the typical enterprise show and it's like, oh we're going to have this bundle announcements that we do. Amazon could have one of these every week somewhere and everyday could be like, here's three new services and they're kind of interesting because everyday that's kind of what they have. >> Yeah and I don't mean to paint it like the wolf is at the door, by any means, but the competitors are at the door. So how much of that factors into this space (mumbles) you pointed everybody else has this huge market share. They're not even (mumbles) they're like the elephant and the gorilla in the room, but at the same time, you do, as you're coming on, Google's still out there looking. There's another player as well. >> Well if you talk to the Amazon people, they don't care about the competitors, they care about their customers. So they focus very much on what their customers are doing. They work on really small teams. If we want to talk about a couple of the announcements today, one of the ones that, at least the community I was watching, it's AWS glue, which really helps to get ETL, which is the extract, transform, and load really a lot of the heavy lifting and undifferentiated heavy lifting that data scientists are doing. Matt Wood, who was up on the keynote said 75% of their time is done on this kind of stuff, and here's something that can greatly reduce it. Few people in the Twitter stream were talking about they've used the beta of it. They're really excited. It was one that didn't sound all that exciting, but once you get into it it's like, oh wow, game changer. This is going to free up so much time. Really accelerate that speed of what I'm doing. Adrian Cockcroft talked about speed and flight freeing me from some of the early constraints. I'm an infrastructure guy by background and everything was like, and I've got that boat anchor stuff that I need to move along and the refresh cycles, and what do I have budget for today? And now I can spin things up so much faster. They give an example of, oh I'm going to do this on Hive and it's going to take me five years to do it as opposed to if I do it in the nice AWS service it takes 155 seconds. We've had lots of examples like this. One of the earliest customers I remember talking to over four years ago, Cycle Computing was like, we would build the super computer and it would have taken us two years and millions of dollars to build, and instead we did the entire project in two months and it cost us $10,000. So those are the kind of transformational things that we expect to hear from Amazon. Lots of customers, but getting into the nuance of it's a lot of building new service. Hulu got on stage and it wasn't that, they didn't say we've killed all of our data centers and everything that you do under Hulu is now under AWS. They said, we wanted to do live TV and live TV is very different from what we had built for in our infrastructure, and the streaming services that Amazon had, and the reach, and the CDN, and everything that they can do there makes it so that we could do this much faster and integrate what we were doing before with the live TV. Put those things together, transformational, expand their business model, and helps move forward Hulu so as they're not just a media company, they're a technology company and Amazon and Amazon support as a partner helps them with that transformation. >> So they're changing their mission obviously, and then technologically they have the help to do that. Part of the migration of AWS migration, we talked about that as well, one of those new services that they rolled out today. I think the quote was migration is a journey and we're going to make it a little simpler right now. >> Yeah we've been hearing for the last couple of years the database. So you know whether I've got Oracle databases, whether it was running SQL before. I want to migrate them, and with Amazon now, I have so many different migration tools that this migration hub now is going to allow me to track all of my migrations across AWS. So this is not for the company that's saying, oh yeah I'm tinkering with some stuff and I'm doing some test dev, but the enterprise that has thousands of applications or lots of locations and lots of people, they now need managers of managers to watch this and some partners involved to help with a lot of these services, but really sprawling all of the services that Amazon have every time they put up one of those eye charts with just all of these different boxes. Every one of them, when you tend to dig in it's like, oh machine learning was a category before and now there's dozens of things inside it. You keep drilling down, I feel like it's that Christopher Nolan movie, Inception. We keep going levels deep as to kind of figure it out. We need to move at cloud time, which is really fast as opposed to kind of the old enterprise time. >> We hit on machine learning. We saw a lot of examples that cut across a pretty diverse set of brands and sectors, and really the democratization of machine learning more or less. At least that was the takeaway I got from it. >> And absolutely. When you mention the competition, this is where Google has a strong position in machine learning. Amazon and Microsoft also pushing there. So it is still early days in machine learning and while Amazon has an undisputed lead in overall cloud, machine learning is one of those areas where everybody's starting from kind of the starting point and Amazon's brought in a lot of really good people. They've got a lot of people working on teams and building out new services. The one that was announced at the end of the keynote is Amazon Macie, which is really around my sensitive data in a global context using machine learning to understand when something's being used when it shouldn't and things like that. I was buying my family some subway tickets and you could only buy two metro cards with one credit card because even if I put in all the data, it was like, no we're only going to let you buy two because if somebody got your credit card they could probably get that and do that. So that's the kind of thing that you're trying to act fast with data no matter where you are because malicious people and hackers, data is the new oil, as we said. It's something that we need to watch and be able to manage even better. So Amazon keeps adding tools and services to allow us to use our data, protect our data, and harness the value of data. I've really said, data is the new flywheel for technology going forward. Amazon for years talked about the flywheels of customers. They add new services, more customers come on board that drives new services and now data is really that next flywheel that's going to drive that next bunch of years of innovation to come. >> You've talked a lot about announcements that we just heard about in the keynote. Big announcement fairly recently about the cloud data computing foundation. So all of the sudden they, I'd say not giving the Heisman, if you will, the Kubernetes, but maybe not embracing it, right? Fair enough to say. Different story now. All of the sudden they're platinum level on the board. They have a voice on how Kubernetes is going to be rolled out going forward, or I guess maybe how Kubernetes is going to be working with AWS going forward. >> And my comment, I gave a quote to SiliconANGLE. I'm on the analyst side of the media. This side had written an article and I said, it's a good step. I saw a great headline that was like, Amazon gives $350,000. They're at least contributing with the financial piece, but when you dig in and read, there was a medium blog post written by Adrian Cockcroft. He didn't touch on it at all in the keynote this morning. Which I was a little surprised about, but what he said is, we're contributing, we're greatly involved, and there's all of these things that are happening in the CNCF, but Amazon has not said, and here is our service to enable Kubernetes as a first class citizen in there. They have the AWS container service, which is ACS which doesn't use Kubernetes. Until this recent news, I could layer Kubernetes on top and there are a lot of offerings to do that. What I'd like to be able to hear is, what service is really Amazon going to offer with that. My expectation not knowing any concrete details is by the time we get to the big show in November, they will have that baked out war, probably have some announcements there. Hoping at this show to be able to talk to some people to really find out what's happening inside really that Kubernetes piece, 'cause that helps not only with really migrations. If I'm built with Kubernetes, it's built with containers. Containers are also the underlying component when I'm doing things like serverless, AWS Lambda. So if I can use Kubernetes, I can build one way and use multiple environments. Whether that be public cloud or private clouds. So how much will Amazon embrace that, how much will they use this. as well we're enabling Kubernetes so if you've got a Kubernetes solution, you can now get into another migration service to Amazon or will they open up a little bit more? We've really been watching to see as Amazon builds out their hybrid cloud offering. Which is how do they get into the customer's data center because we've seen that maturation of public cloud only, everything into the public cloud to now Lambda starts to reach out a little bit with the green grass, they've got their snow balls, they've got the partnership with VMware, which we expect to hear lots more about at VMworld at the end of this month. They've got partnerships with Redhat and a whole lot of other companies that they're working at to really expanding how they get all of these wonderful Amazon services that are in the public cloud. How do they reach into the customer's data centers themselves and start leveraging those services? All of those free services of data that are getting added. Lots of companies would want to get access to them. >> Well full lineup of guests, as always. Great lineup of guests, but before we head out, you said you're with Wikibon, you do great analyst work there and you've got that inquiring mind. You're a curious guy. What are you curious about today? What do you kind of want to walk away from here tonight learning a little bit more about? >> So as I mentioned, the whole Kubernetes story absolutely is one that we want to hear about. Going to talk to a lot of the partners. So we've seen a lot of the analytics machine learning type solutions really getting to the public (mumbles) so it's good to get a pulse of really this ecosystem because while Amazon is, we've said it's not only the elephant in the room, Dave Alante, the chief analyst at Wikibon said, they're the cheetah, they move rally fast, they're really nimble. Amazon, not the easiest always to partner with. How's the room feel, how are the customers, how are the partners, how much are they really in on AWS, how many of them are multi cloud and I'm using Google for some of the data solutions and Microsoft apps really have me involved. So Amazon loves to say people that are all in. We had one of the speakers that talked, Zocdoc, which one that allows me to set appointments with doctors much faster using technology. Analytics say rather than 24 days you could do 24 hours. They went from no AWS to fully 100% in on AWS in less than 12 months. So those are really impressive ones. Obviously it's a technology center company but you see large companies. FICO was the other one up on stage. Actually hopping to have FICO on the program today. They are, what was it, over a 60 year old company so obviously they have a lot of legacy, and how AWS fits into their environment. I actually interviewed someone from FICO a couple of years ago at an OpenStack show talking about their embrace of containers and containers allows them to get into public cloud a little bit easier. So I'd love to kind of dig into those pieces. What's the post of the customers, what's the post of the partner ecosystem, and are there chinks in the armor? You mentioned the competitive piece there. Usually when you come to an Amazon show, it's all Amazon all the time. The number one gripe usually is it's kind of pricing, and Amazon's made some moves. We did a bunch of interviews the week of the Google Next event talking about Google cloud and there was a lot of kind of small medium business that said Google was priced better, Google has a clear advantage (mumbles) I'm going away from Amazon. The week after the show, Amazon changed their pricing, talked to some of the same people and they're like, yeah Amazon leveled the playing field. So Amazon listens and moves very fast. So if they're not the first to create an offering, they will spin something up very fast. They can readjust their security, their pricing to make sure that they are listening to their customers and meeting them not necessarily in response to competitors, but getting what the customers need and therefore if the customers are griping a little bit about something that they see that's interesting, or a pain point that they've had. Like we've talked about the AWS Glue wasn't something that a competitor had. It was that this is a pain point that they saw a lot of time is on it, and they are looking to take that pain out. One of the line that always gets poked about Amazon is they say your margin is our opportunity and your pain as a customer is our opportunity too. So Amazon always listening. >> All right, a lot on the plate here this day we have for you at AWS Summit. We'll be back with much more as we continue here on the Cube and AWS Summit 2017 from New York City. (upbeat techno music)

Published Date : Aug 14 2017

SUMMARY :

Brought to you buy Amazon Web Services. What's the vibe you got, what's the feeling? and the Javits center can really hold Talking about revenue growth still in the uptick So it's not like they had to say, oh geeze, but at the same time, you do, One of the earliest customers I remember talking to and then technologically they have the help to do that. and some partners involved to help and really the democratization of machine learning and harness the value of data. So all of the sudden they, and here is our service to enable Kubernetes and you've got that inquiring mind. and they are looking to take that pain out. on the Cube and AWS Summit 2017 from New York City.

ENTITIES

Entity	Category	Confidence
Matt Wood	PERSON	0.99+
Jeff	PERSON	0.99+
Adrian Cockcroft	PERSON	0.99+
Amazon	ORGANIZATION	0.99+
Dave Alante	PERSON	0.99+
Microsoft	ORGANIZATION	0.99+
John Walls	PERSON	0.99+
New York	LOCATION	0.99+
$350,000	QUANTITY	0.99+
155 seconds	QUANTITY	0.99+
Stu Miniman	PERSON	0.99+
AWS	ORGANIZATION	0.99+
New York City	LOCATION	0.99+
24 hours	QUANTITY	0.99+
Andy Jassy	PERSON	0.99+
Google	ORGANIZATION	0.99+
Las Vegas	LOCATION	0.99+
75%	QUANTITY	0.99+
November	DATE	0.99+
five years	QUANTITY	0.99+
one credit card	QUANTITY	0.99+
five	QUANTITY	0.99+
24 days	QUANTITY	0.99+
two	QUANTITY	0.99+
Christopher Nolan	PERSON	0.99+
Hulu	ORGANIZATION	0.99+
Redhat	ORGANIZATION	0.99+
Amazon Web Services	ORGANIZATION	0.99+
$10,000	QUANTITY	0.99+
Stu	PERSON	0.99+
two years	QUANTITY	0.99+
Adrian	PERSON	0.99+
tonight	DATE	0.99+
today	DATE	0.99+
VMware	ORGANIZATION	0.99+
four years ago	DATE	0.99+
5,000 folks	QUANTITY	0.99+
100%	QUANTITY	0.99+
less than 12 months	QUANTITY	0.99+
Javits Convention Center	LOCATION	0.99+
16 billion dollar	QUANTITY	0.99+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for The Hive: