Evolving InfluxDB into the Smart Data Platform Open
>> This past May, the Cube, in collaboration with Influx Data shared with you the latest innovations in Time series databases. We talked at length about why a purpose-built time series database for many use cases, was a superior alternative to general purpose databases trying to do the same thing. Now, you may, you may remember that time series data is any data that's stamped in time and if it's stamped, it can be analyzed historically. And when we introduced the concept to the community we talked about how in theory those time slices could be taken, you know every hour, every minute, every second, you know, down to the millisecond and how the world was moving toward realtime or near realtime data analysis to support physical infrastructure like sensors, and other devices and IOT equipment. Time series databases have had to evolve to efficiently support realtime data in emerging use, use cases in IOT and other use cases. And to do that, new architectural innovations have to be brought to bear. As is often the case, open source software is the linchpin to those innovations. Hello and welcome to Evolving Influx DB into the Smart Data platform, made possible by influx data and produced by the cube. My name is Dave Vellante, and I'll be your host today. Now, in this program, we're going to dig pretty deep into what's happening with Time series data generally, and specifically how Influx DB is evolving to support new workloads and demands and data, and specifically around data analytics use cases in real time. Now, first we're going to hear from Brian Gilmore who is the director of IOT and emerging technologies at Influx Data. And we're going to talk about the continued evolution of Influx DB and the new capabilities enabled by open source generally and specific tools. And in this program, you're going to hear a lot about things like rust implementation of Apache Arrow, the use of Parquet and tooling such as data fusion, which are powering a new engine for Influx db. Now, these innovations, they evolve the idea of time series analysis by dramatically increasing the granularity of time series data by compressing the historical time slices if you will, from, for example minutes down to milliseconds. And at the same time, enabling real time analytics with an architecture that can process data much faster and much more efficiently. Now, after Brian, we're going to hear from Anais Dotis-Georgiou who is a developer advocate at Influx Data. And we're going to get into the "why's" of these open source capabilities, and how they contribute to the evolution of the Influx DB platform. And then we're going to close the program with Tim Yocum. He's the director of engineering at Influx Data, and he's going to explain how the Influx DB community actually evolved the data engine in mid-flight and which decisions went into the innovations that are coming to the market. Thank you for being here. We hope you enjoy the program. Let's get started.
SUMMARY :
by compressing the historical time slices
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Brian Gilmore | PERSON | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
Brian | PERSON | 0.99+ |
Tim Yocum | PERSON | 0.99+ |
Influx Data | ORGANIZATION | 0.99+ |
Anais Dotis-Georgiou | PERSON | 0.99+ |
Influx DB | TITLE | 0.99+ |
InfluxDB | TITLE | 0.94+ |
first | QUANTITY | 0.91+ |
today | DATE | 0.88+ |
second | QUANTITY | 0.85+ |
Time | TITLE | 0.82+ |
Parquet | TITLE | 0.76+ |
Apache | ORGANIZATION | 0.75+ |
past May | DATE | 0.75+ |
Influx | TITLE | 0.75+ |
IOT | ORGANIZATION | 0.69+ |
Cube | ORGANIZATION | 0.65+ |
influx | ORGANIZATION | 0.53+ |
Arrow | TITLE | 0.48+ |
Brian Gilmore, InfluxData
(soft upbeat music) >> Okay, we're kicking things off with Brian Gilmore. He's the director of IoT, an emerging technology at InfluxData. Brian, welcome to the program. Thanks for coming on. >> Thanks, Dave, great to be here. I appreciate the time. >> Hey, explain why InfluxDB, you know, needs a new engine. Was there something wrong with the current engine? What's going on there? >> No, no, not at all. I mean, I think, for us it's been about staying ahead of the market. I think, you know, if we think about what our customers are coming to us sort of with now, you know, related to requests like SQL query support, things like that, we have to figure out a way to execute those for them in a way that will scale long term. And then we also want to make sure we're innovating, we're sort of staying ahead of the market as well, and sort of anticipating those future needs. So, you know, this is really a transparent change for our customers. I mean, I think we'll be adding new capabilities over time that sort of leverage this new engine. But, you know, initially, the customers who are using us are going to see just great improvements in performance, you know, especially those that are working at the top end of the workload scale, you know, the massive data volumes and things like that. >> Yeah, and we're going to get into that today and the architecture and the like. But what was the catalyst for the enhancements? I mean, when and how did this all come about? >> Well, I mean, like three years ago, we were primarily on premises, right? I mean, I think we had our open source, we had an enterprise product. And sort of shifting that technology, especially the open source code base to a service basis where we were hosting it through, you know, multiple cloud providers. That was a long journey. (chuckles) I guess, you know, phase one was, we wanted to host enterprise for our customers, so we sort of created a service that we just managed and ran our enterprise product for them. You know, phase two of this cloud effort was to optimize for like multi-tenant, multi-cloud, be able to host it in a truly like SAS manner where we could use, you know, some type of customer activity or consumption as the pricing vector. And that was sort of the birth of the real first InfluxDB cloud, you know, which has been really successful. We've seen, I think, like 60,000 people sign up. And we've got tons and tons of both enterprises as well as like new companies, developers, and of course a lot of home hobbyists and enthusiasts who are using out on a daily basis. And having that sort of big pool of very diverse and varied customers to chat with as they're using the product, as they're giving us feedback, et cetera, has, you know, pointed us in a really good direction in terms of making sure we're continuously improving that, and then also making these big leaps as we're doing with this new engine. >> All right, so you've called it a transparent change for customers, so I'm presuming it's non-disruptive, but I really want to understand how much of a pivot this is, and what does it take to make that shift from, you know, time series specialist to real time analytics and being able to support both? >> Yeah, I mean, it's much more of an evolution, I think, than like a shift or a pivot. Time series data is always going to be fundamental in sort of the basis of the solutions that we offer our customers, and then also the ones that they're building on the sort of raw APIs of our platform themselves. The time series market is one that we've worked diligently to lead. I mean, I think when it comes to like metrics, especially like sensor data and app and infrastructure metrics. If we're being honest though, I think our user base is well aware that the way we were architected was much more towards those sort of like backwards-looking historical type analytics, which are key for troubleshooting and making sure you don't, you know, run into the same problem twice. But, you know, we had to ask ourselves like, what can we do to like better handle those queries from a performance and a time to response on the queries, and can we get that to the point where the result sets are coming back so quickly from the time of query that we can like, limit that window down to minutes and then seconds? And now with this new engine, we're really starting to talk about a query window that could be like returning results in, you know, milliseconds of time since it hit the ingest queue. And that's really getting to the point where, as your data is available, you can use it and you can query it, you can visualize it, you can do all those sort of magical things with it. And I think getting all of that to a place where we're saying like, yes to the customer on, you know, all of the real time queries, the multiple language query support. But, you know, it was hard, but we're now at a spot where we can start introducing that to, you know, a limited number of customers, strategic customers and strategic availabilities zones to start, but, you know, everybody over time. >> So you're basically going from what happened to, and you can still do that, obviously, but to what's happening now in the moment? >> Yeah. Yeah. I mean, if you think about time, it's always sort of past, right? I mean, like in the moment right now, whether you're talking about like a millisecond ago or a minute ago, you know, that's pretty much right now, I think for most people, especially in these use cases where you have other sort of components of latency induced by the underlying data collection, the architecture, the infrastructure, the devices, and you know, the sort of highly distributed nature of all of this. So, yeah, I mean, getting a customer or a user to be able to use the data as soon as it is available, is what we're after here. I always thought of real time as before you lose the customer, but now in this context, maybe it's before the machine blows up. >> Yeah, I mean, it is operationally, or operational real time is different. And that's one of the things that really triggered us to know that we were heading in the right direction is just how many sort of operational customers we have, you know, everything from like aerospace and defense. We've got companies monitoring satellites. We've got tons of industrial users using us as a process historian on the plant floor. And if we can satisfy their sort of demands for like real time historical perspective, that's awesome. I think what we're going to do here is we're going to start to like edge into the real time that they're used to in terms of, you know, the millisecond response times that they expect of their control systems, certainly not their historians and databases. >> Is this available, these innovations to InfluxDB cloud customers, only who can access this capability? >> Yeah, I mean, commercially and today, yes. I think we want to emphasize that for now our goal is to get our latest and greatest and our best to everybody over time of course. You know, one of the things we had to do here was like we doubled down on sort of our commitment to open source and availability. So, like, anybody today can take a look at the libraries on our GitHub and can inspect it and even can try to implement or execute some of it themselves in their own infrastructure. We are committed to bringing our sort of latest and greatest to our cloud customers first for a couple of reasons. Number one, you know, there are big workloads and they have high expectations of us. I think number two, it also gives us the opportunity to monitor a little bit more closely how it's working, how they're using it, like how the system itself is performing. And so just, you know, being careful, maybe a little cautious in terms of how big we go with this right away. Just sort of both limits, you know, the risk of any issues that can come with new software roll outs, we haven't seen anything so far. But also it does give us the opportunity to have like meaningful conversations with a small group of users who are using the products. But once we get through that and they give us two thumbs up on it, it'll be like, open the gates and let everybody in. It's going to be exciting time for the whole ecosystem. >> Yeah, that makes a lot of sense. And you can do some experimentation and, you know, using the cloud resources. Let's dig into some of the architectural and technical innovations that are going to help deliver on this vision. What should we know there? >> Well, I mean, I think, foundationally, we built the new core on Rust. This is a new very sort of popular systems language. It's extremely efficient, but it's also built for speed and memory safety, which goes back to that us being able to like deliver it in a way that is, you know, something we can inspect very closely, but then also rely on the fact that it's going to behave well, and if it does find error conditions. I mean, we've loved working with Go, and a lot of our libraries will continue to be sort of implemented in Go, but when it came to this particular new engine, that power performance and stability of Rust was critical. On top of that, like, we've also integrated Apache Arrow and Apache Parquet for persistence. I think, for anybody who's really familiar with the nuts and bolts of our backend and our TSI and our time series merge trees, this is a big break from that. You know, Arrow on the sort of in mem side and then Parquet in the on disk side. It allows us to present, you know, a unified set of APIs for those really fast real time queries that we talked about, as well as for very large, you know, historical sort of bulk data archives in that Parquet format, which is also cool because there's an entire ecosystem sort of popping up around Parquet in terms of the machine learning community. And getting that all to work, we had to glue it together with Arrow Flight. That's sort of what we're using as our RPC component. It handles the orchestration and the transportation of the columnar data now, we're moving to like a true columnar database model for this version of the engine. You know, and it removes a lot of overhead for us in terms of having to manage all that serialization, the deserialization, and, you know, to that again, like, blurring that line between real time and historical data, it's highly optimized for both streaming micro batch and then batches, but true streaming as well. >> Yeah, again, I mean, it's funny. You mentioned Rust. It's been around for a long time but it's popularity is, you know, really starting to hit that steep part of the S-curve. And we're going to dig into more of that, but give us, is there anything else that we should know about, Brian? Give us the last word. >> Well, I mean, I think first, I'd like everybody sort of watching, just to like, take a look at what we're offering in terms of early access in beta programs. I mean, if you want to participate or if you want to work sort of in terms of early access with the new engine, please reach out to the team. I'm sure, you know, there's a lot of communications going out and it'll be highly featured on our website. But reach out to the team. Believe it or not, like we have a lot more going on than just the new engine. And so there are also other programs, things we're offering to customers in terms of the user interface, data collection and things like that. And, you know, if you're a customer of ours and you have a sales team, a commercial team that you work with, you can reach out to them and see what you can get access to, because we can flip a lot of stuff on, especially in cloud through feature flags. But if there's something new that you want to try out, we'd just love to hear from you. And then, you know, our goal would be, that as we give you access to all of these new cool features that, you know, you would give us continuous feedback on these products and services, not only like what you need today, but then what you'll need tomorrow to sort of build the next versions of your business. Because, you know, the whole database, the ecosystem as it expands out into this vertically-oriented stack of cloud services, and enterprise databases, and edge databases, you know, it's going to be what we all make it together, not just those of us who are employed by InfluxDB. And then finally, I would just say, please, like, watch and Anais' and Tim's sessions. Like, these are two of our best and brightest. They're totally brilliant, completely pragmatic, and they are most of all customer-obsessed, which is amazing. And there's no better takes, like honestly, on the sort of technical details of this than theirs, especially when it comes to the value that these investments will bring to our customers and our communities. So, encourage you to, you know, pay more attention to them than you did to me, for sure. >> Brian Gilmore, great stuff. Really appreciate your time. Thank you. >> Yeah, thanks David, it was awesome. Looking forward to it. >> Yeah, me too. I'm looking forward to see how the community actually applies these new innovations and goes beyond just the historical into the real time. Really hot area. As Brian said, in a moment, I'll be right back with Anais Dotis-Georgiou to dig into the critical aspects of key open source components of the InfluxDB engine, including Rust, Arrow, Parquet, Data Fusion. Keep it right there. You don't want to miss this. (soft upbeat music)
SUMMARY :
He's the director of IoT, I appreciate the time. you know, needs a new engine. sort of with now, you know, and the architecture and the like. I guess, you know, phase one was, that the way we were architected the devices, and you know, in terms of, you know, the And so just, you know, being careful, experimentation and, you know, in a way that is, you know, but it's popularity is, you know, And then, you know, our goal would be, Really appreciate your time. Looking forward to it. and goes beyond just the
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
David | PERSON | 0.99+ |
Brian Gilmore | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Brian | PERSON | 0.99+ |
Tim | PERSON | 0.99+ |
60,000 people | QUANTITY | 0.99+ |
InfluxData | ORGANIZATION | 0.99+ |
two | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
three years ago | DATE | 0.99+ |
twice | QUANTITY | 0.99+ |
Parquet | TITLE | 0.99+ |
both | QUANTITY | 0.98+ |
Anais' | PERSON | 0.98+ |
first | QUANTITY | 0.98+ |
tomorrow | DATE | 0.98+ |
Rust | TITLE | 0.98+ |
one | QUANTITY | 0.98+ |
a minute ago | DATE | 0.95+ |
two thumbs | QUANTITY | 0.95+ |
Arrow | TITLE | 0.94+ |
Anais Dotis-Georgiou | PERSON | 0.92+ |
tons | QUANTITY | 0.9+ |
InfluxDB | TITLE | 0.85+ |
Bri | PERSON | 0.82+ |
Apache | ORGANIZATION | 0.82+ |
InfluxDB | ORGANIZATION | 0.8+ |
GitHub | ORGANIZATION | 0.78+ |
phase one | QUANTITY | 0.73+ |
both enterprises | QUANTITY | 0.69+ |
SAS | ORGANIZATION | 0.68+ |
phase two | QUANTITY | 0.67+ |
Go | TITLE | 0.65+ |
Gilmore | PERSON | 0.63+ |
millisecond ago | DATE | 0.62+ |
Arrow | ORGANIZATION | 0.59+ |
Flight | ORGANIZATION | 0.52+ |
Data Fusion | TITLE | 0.46+ |
Go | ORGANIZATION | 0.41+ |
Data Power Panel V3
(upbeat music) >> The stampede to cloud and massive VC investments has led to the emergence of a new generation of object store based data lakes. And with them two important trends, actually three important trends. First, a new category that combines data lakes and data warehouses aka the lakehouse is emerged as a leading contender to be the data platform of the future. And this novelty touts the ability to address data engineering, data science, and data warehouse workloads on a single shared data platform. The other major trend we've seen is query engines and broader data fabric virtualization platforms have embraced NextGen data lakes as platforms for SQL centric business intelligence workloads, reducing, or somebody even claim eliminating the need for separate data warehouses. Pretty bold. However, cloud data warehouses have added complimentary technologies to bridge the gaps with lakehouses. And the third is many, if not most customers that are embracing the so-called data fabric or data mesh architectures. They're looking at data lakes as a fundamental component of their strategies, and they're trying to evolve them to be more capable, hence the interest in lakehouse, but at the same time, they don't want to, or can't abandon their data warehouse estate. As such we see a battle royale is brewing between cloud data warehouses and cloud lakehouses. Is it possible to do it all with one cloud center analytical data platform? Well, we're going to find out. My name is Dave Vellante and welcome to the data platform's power panel on theCUBE. Our next episode in a series where we gather some of the industry's top analysts to talk about one of our favorite topics, data. In today's session, we'll discuss trends, emerging options, and the trade offs of various approaches and we'll name names. Joining us today are Sanjeev Mohan, who's the principal at SanjMo, Tony Baers, principal at dbInsight. And Doug Henschen is the vice president and principal analyst at Constellation Research. Guys, welcome back to theCUBE. Great to see you again. >> Thank guys. Thank you. >> Thank you. >> So it's early June and we're gearing up with two major conferences, there's several database conferences, but two in particular that were very interested in, Snowflake Summit and Databricks Data and AI Summit. Doug let's start off with you and then Tony and Sanjeev, if you could kindly weigh in. Where did this all start, Doug? The notion of lakehouse. And let's talk about what exactly we mean by lakehouse. Go ahead. >> Yeah, well you nailed it in your intro. One platform to address BI data science, data engineering, fewer platforms, less cost, less complexity, very compelling. You can credit Databricks for coining the term lakehouse back in 2020, but it's really a much older idea. You can go back to Cloudera introducing their Impala database in 2012. That was a database on top of Hadoop. And indeed in that last decade, by the middle of that last decade, there were several SQL on Hadoop products, open standards like Apache Drill. And at the same time, the database vendors were trying to respond to this interest in machine learning and the data science. So they were adding SQL extensions, the likes Hudi and Vertical we're adding SQL extensions to support the data science. But then later in that decade with the shift to cloud and object storage, you saw the vendor shift to this whole cloud, and object storage idea. So you have in the database camp Snowflake introduce Snowpark to try to address the data science needs. They introduced that in 2020 and last year they announced support for Python. You also had Oracle, SAP jumped on this lakehouse idea last year, supporting both the lake and warehouse single vendor, not necessarily quite single platform. Google very recently also jumped on the bandwagon. And then you also mentioned, the SQL engine camp, the Dremios, the Ahanas, the Starbursts, really doing two things, a fabric for distributed access to many data sources, but also very firmly planning that idea that you can just have the lake and we'll help you do the BI workloads on that. And then of course, the data lake camp with the Databricks and Clouderas providing a warehouse style deployments on top of their lake platforms. >> Okay, thanks, Doug. I'd be remiss those of you who me know that I typically write my own intros. This time my colleagues fed me a lot of that material. So thank you. You guys make it easy. But Tony, give us your thoughts on this intro. >> Right. Well, I very much agree with both of you, which may not make for the most exciting television in terms of that it has been an evolution just like Doug said. I mean, for instance, just to give an example when Teradata bought AfterData was initially seen as a hardware platform play. In the end, it was basically, it was all those after functions that made a lot of sort of big data analytics accessible to SQL. (clears throat) And so what I really see just in a more simpler definition or functional definition, the data lakehouse is really an attempt by the data lake folks to make the data lake friendlier territory to the SQL folks, and also to get into friendly territory, to all the data stewards, who are basically concerned about the sprawl and the lack of control in governance in the data lake. So it's really kind of a continuing of an ongoing trend that being said, there's no action without counter action. And of course, at the other end of the spectrum, we also see a lot of the data warehouses starting to edit things like in database machine learning. So they're certainly not surrendering without a fight. Again, as Doug was mentioning, this has been part of a continual blending of platforms that we've seen over the years that we first saw in the Hadoop years with SQL on Hadoop and data warehouses starting to reach out to cloud storage or should say the HDFS and then with the cloud then going cloud native and therefore trying to break the silos down even further. >> Now, thank you. And Sanjeev, data lakes, when we first heard about them, there were such a compelling name, and then we realized all the problems associated with them. So pick it up from there. What would you add to Doug and Tony? >> I would say, these are excellent points that Doug and Tony have brought to light. The concept of lakehouse was going on to your point, Dave, a long time ago, long before the tone was invented. For example, in Uber, Uber was trying to do a mix of Hadoop and Vertical because what they really needed were transactional capabilities that Hadoop did not have. So they weren't calling it the lakehouse, they were using multiple technologies, but now they're able to collapse it into a single data store that we call lakehouse. Data lakes, excellent at batch processing large volumes of data, but they don't have the real time capabilities such as change data capture, doing inserts and updates. So this is why lakehouse has become so important because they give us these transactional capabilities. >> Great. So I'm interested, the name is great, lakehouse. The concept is powerful, but I get concerned that it's a lot of marketing hype behind it. So I want to examine that a bit deeper. How mature is the concept of lakehouse? Are there practical examples that really exist in the real world that are driving business results for practitioners? Tony, maybe you could kick that off. >> Well, put it this way. I think what's interesting is that both data lakes and data warehouse that each had to extend themselves. To believe the Databricks hype it's that this was just a natural extension of the data lake. In point of fact, Databricks had to go outside its core technology of Spark to make the lakehouse possible. And it's a very similar type of thing on the part with data warehouse folks, in terms of that they've had to go beyond SQL, In the case of Databricks. There have been a number of incremental improvements to Delta lake, to basically make the table format more performative, for instance. But the other thing, I think the most dramatic change in all that is in their SQL engine and they had to essentially pretty much abandon Spark SQL because it really, in off itself Spark SQL is essentially stop gap solution. And if they wanted to really address that crowd, they had to totally reinvent SQL or at least their SQL engine. And so Databricks SQL is not Spark SQL, it is not Spark, it's basically SQL that it's adapted to run in a Spark environment, but the underlying engine is C++, it's not scale or anything like that. So Databricks had to take a major detour outside of its core platform to do this. So to answer your question, this is not mature because these are all basically kind of, even though the idea of blending platforms has been going on for well over a decade, I would say that the current iteration is still fairly immature. And in the cloud, I could see a further evolution of this because if you think through cloud native architecture where you're essentially abstracting compute from data, there is no reason why, if let's say you are dealing with say, the same basically data targets say cloud storage, cloud object storage that you might not apportion the task to different compute engines. And so therefore you could have, for instance, let's say you're Google, you could have BigQuery, perform basically the types of the analytics, the SQL analytics that would be associated with the data warehouse and you could have BigQuery ML that does some in database machine learning, but at the same time for another part of the query, which might involve, let's say some deep learning, just for example, you might go out to let's say the serverless spark service or the data proc. And there's no reason why Google could not blend all those into a coherent offering that's basically all triggered through microservices. And I just gave Google as an example, if you could generalize that with all the other cloud or all the other third party vendors. So I think we're still very early in the game in terms of maturity of data lakehouses. >> Thanks, Tony. So Sanjeev, is this all hype? What are your thoughts? >> It's not hype, but completely agree. It's not mature yet. Lakehouses have still a lot of work to do, so what I'm now starting to see is that the world is dividing into two camps. On one hand, there are people who don't want to deal with the operational aspects of vast amounts of data. They are the ones who are going for BigQuery, Redshift, Snowflake, Synapse, and so on because they want the platform to handle all the data modeling, access control, performance enhancements, but these are trade off. If you go with these platforms, then you are giving up on vendor neutrality. On the other side are those who have engineering skills. They want the independence. In other words, they don't want vendor lock in. They want to transform their data into any number of use cases, especially data science, machine learning use case. What they want is agility via open file formats using any compute engine. So why do I say lakehouses are not mature? Well, cloud data warehouses they provide you an excellent user experience. That is the main reason why Snowflake took off. If you have thousands of cables, it takes minutes to get them started, uploaded into your warehouse and start experimentation. Table formats are far more resonating with the community than file formats. But once the cost goes up of cloud data warehouse, then the organization start exploring lakehouses. But the problem is lakehouses still need to do a lot of work on metadata. Apache Hive was a fantastic first attempt at it. Even today Apache Hive is still very strong, but it's all technical metadata and it has so many different restrictions. That's why we see Databricks is investing into something called Unity Catalog. Hopefully we'll hear more about Unity Catalog at the end of the month. But there's a second problem. I just want to mention, and that is lack of standards. All these open source vendors, they're running, what I call ego projects. You see on LinkedIn, they're constantly battling with each other, but end user doesn't care. End user wants a problem to be solved. They want to use Trino, Dremio, Spark from EMR, Databricks, Ahana, DaaS, Frink, Athena. But the problem is that we don't have common standards. >> Right. Thanks. So Doug, I worry sometimes. I mean, I look at the space, we've debated for years, best of breed versus the full suite. You see AWS with whatever, 12 different plus data stores and different APIs and primitives. You got Oracle putting everything into its database. It's actually done some interesting things with MySQL HeatWave, so maybe there's proof points there, but Snowflake really good at data warehouse, simplifying data warehouse. Databricks, really good at making lakehouses actually more functional. Can one platform do it all? >> Well in a word, I can't be best at breed at all things. I think the upshot of and cogen analysis from Sanjeev there, the database, the vendors coming out of the database tradition, they excel at the SQL. They're extending it into data science, but when it comes to unstructured data, data science, ML AI often a compromise, the data lake crowd, the Databricks and such. They've struggled to completely displace the data warehouse when it really gets to the tough SLAs, they acknowledge that there's still a role for the warehouse. Maybe you can size down the warehouse and offload some of the BI workloads and maybe and some of these SQL engines, good for ad hoc, minimize data movement. But really when you get to the deep service level, a requirement, the high concurrency, the high query workloads, you end up creating something that's warehouse like. >> Where do you guys think this market is headed? What's going to take hold? Which projects are going to fade away? You got some things in Apache projects like Hudi and Iceberg, where do they fit Sanjeev? Do you have any thoughts on that? >> So thank you, Dave. So I feel that table formats are starting to mature. There is a lot of work that's being done. We will not have a single product or single platform. We'll have a mixture. So I see a lot of Apache Iceberg in the news. Apache Iceberg is really innovating. Their focus is on a table format, but then Delta and Apache Hudi are doing a lot of deep engineering work. For example, how do you handle high concurrency when there are multiple rights going on? Do you version your Parquet files or how do you do your upcerts basically? So different focus, at the end of the day, the end user will decide what is the right platform, but we are going to have multiple formats living with us for a long time. >> Doug is Iceberg in your view, something that's going to address some of those gaps in standards that Sanjeev was talking about earlier? >> Yeah, Delta lake, Hudi, Iceberg, they all address this need for consistency and scalability, Delta lake open technically, but open for access. I don't hear about Delta lakes in any worlds, but Databricks, hearing a lot of buzz about Apache Iceberg. End users want an open performance standard. And most recently Google embraced Iceberg for its recent a big lake, their stab at having supporting both lakes and warehouses on one conjoined platform. >> And Tony, of course, you remember the early days of the sort of big data movement you had MapR was the most closed. You had Horton works the most open. You had Cloudera in between. There was always this kind of contest as to who's the most open. Does that matter? Are we going to see a repeat of that here? >> I think it's spheres of influence, I think, and Doug very much was kind of referring to this. I would call it kind of like the MongoDB syndrome, which is that you have... and I'm talking about MongoDB before they changed their license, open source project, but very much associated with MongoDB, which basically, pretty much controlled most of the contributions made decisions. And I think Databricks has the same iron cloud hold on Delta lake, but still the market is pretty much associated Delta lake as the Databricks, open source project. I mean, Iceberg is probably further advanced than Hudi in terms of mind share. And so what I see that's breaking down to is essentially, basically the Databricks open source versus the everything else open source, the community open source. So I see it's a very similar type of breakdown that I see repeating itself here. >> So by the way, Mongo has a conference next week, another data platform is kind of not really relevant to this discussion totally. But in the sense it is because there's a lot of discussion on earnings calls these last couple of weeks about consumption and who's exposed, obviously people are concerned about Snowflake's consumption model. Mongo is maybe less exposed because Atlas is prominent in the portfolio, blah, blah, blah. But I wanted to bring up the little bit of controversy that we saw come out of the Snowflake earnings call, where the ever core analyst asked Frank Klutman about discretionary spend. And Frank basically said, look, we're not discretionary. We are deeply operationalized. Whereas he kind of poo-pooed the lakehouse or the data lake, et cetera, saying, oh yeah, data scientists will pull files out and play with them. That's really not our business. Do any of you have comments on that? Help us swing through that controversy. Who wants to take that one? >> Let's put it this way. The SQL folks are from Venus and the data scientists are from Mars. So it means it really comes down to it, sort that type of perception. The fact is, is that, traditionally with analytics, it was very SQL oriented and that basically the quants were kind of off in their corner, where they're using SaaS or where they're using Teradata. It's really a great leveler today, which is that, I mean basic Python it's become arguably one of the most popular programming languages, depending on what month you're looking at, at the title index. And of course, obviously SQL is, as I tell the MongoDB folks, SQL is not going away. You have a large skills base out there. And so basically I see this breaking down to essentially, you're going to have each group that's going to have its own natural preferences for its home turf. And the fact that basically, let's say the Python and scale of folks are using Databricks does not make them any less operational or machine critical than the SQL folks. >> Anybody else want to chime in on that one? >> Yeah, I totally agree with that. Python support in Snowflake is very nascent with all of Snowpark, all of the things outside of SQL, they're very much relying on partners too and make things possible and make data science possible. And it's very early days. I think the bottom line, what we're going to see is each of these camps is going to keep working on doing better at the thing that they don't do today, or they're new to, but they're not going to nail it. They're not going to be best of breed on both sides. So the SQL centric companies and shops are going to do more data science on their database centric platform. That data science driven companies might be doing more BI on their leagues with those vendors and the companies that have highly distributed data, they're going to add fabrics, and maybe offload more of their BI onto those engines, like Dremio and Starburst. >> So I've asked you this before, but I'll ask you Sanjeev. 'Cause Snowflake and Databricks are such great examples 'cause you have the data engineering crowd trying to go into data warehousing and you have the data warehousing guys trying to go into the lake territory. Snowflake has $5 billion in the balance sheet and I've asked you before, I ask you again, doesn't there has to be a semantic layer between these two worlds? Does Snowflake go out and do M&A and maybe buy ad scale or a data mirror? Or is that just sort of a bandaid? What are your thoughts on that Sanjeev? >> I think semantic layer is the metadata. The business metadata is extremely important. At the end of the day, the business folks, they'd rather go to the business metadata than have to figure out, for example, like let's say, I want to update somebody's email address and we have a lot of overhead with data residency laws and all that. I want my platform to give me the business metadata so I can write my business logic without having to worry about which database, which location. So having that semantic layer is extremely important. In fact, now we are taking it to the next level. Now we are saying that it's not just a semantic layer, it's all my KPIs, all my calculations. So how can I make those calculations independent of the compute engine, independent of the BI tool and make them fungible. So more disaggregation of the stack, but it gives us more best of breed products that the customers have to worry about. >> So I want to ask you about the stack, the modern data stack, if you will. And we always talk about injecting machine intelligence, AI into applications, making them more data driven. But when you look at the application development stack, it's separate, the database is tends to be separate from the data and analytics stack. Do those two worlds have to come together in the modern data world? And what does that look like organizationally? >> So organizationally even technically I think it is starting to happen. Microservices architecture was a first attempt to bring the application and the data world together, but they are fundamentally different things. For example, if an application crashes, that's horrible, but Kubernetes will self heal and it'll bring the application back up. But if a database crashes and corrupts your data, we have a huge problem. So that's why they have traditionally been two different stacks. They are starting to come together, especially with data ops, for instance, versioning of the way we write business logic. It used to be, a business logic was highly embedded into our database of choice, but now we are disaggregating that using GitHub, CICD the whole DevOps tool chain. So data is catching up to the way applications are. >> We also have databases, that trans analytical databases that's a little bit of what the story is with MongoDB next week with adding more analytical capabilities. But I think companies that talk about that are always careful to couch it as operational analytics, not the warehouse level workloads. So we're making progress, but I think there's always going to be, or there will long be a separate analytical data platform. >> Until data mesh takes over. (all laughing) Not opening a can of worms. >> Well, but wait, I know it's out of scope here, but wouldn't data mesh say, hey, do take your best of breed to Doug's earlier point. You can't be best of breed at everything, wouldn't data mesh advocate, data lakes do your data lake thing, data warehouse, do your data lake, then you're just a node on the mesh. (Tony laughs) Now you need separate data stores and you need separate teams. >> To my point. >> I think, I mean, put it this way. (laughs) Data mesh itself is a logical view of the world. The data mesh is not necessarily on the lake or on the warehouse. I think for me, the fear there is more in terms of, the silos of governance that could happen and the silo views of the world, how we redefine. And that's why and I want to go back to something what Sanjeev said, which is that it's going to be raising the importance of the semantic layer. Now does Snowflake that opens a couple of Pandora's boxes here, which is one, does Snowflake dare go into that space or do they risk basically alienating basically their partner ecosystem, which is a key part of their whole appeal, which is best of breed. They're kind of the same situation that Informatica was where in the early 2000s, when Informatica briefly flirted with analytic applications and realized that was not a good idea, need to redouble down on their core, which was data integration. The other thing though, that raises the importance of and this is where the best of breed comes in, is the data fabric. My contention is that and whether you use employee data mesh practice or not, if you do employee data mesh, you need data fabric. If you deploy data fabric, you don't necessarily need to practice data mesh. But data fabric at its core and admittedly it's a category that's still very poorly defined and evolving, but at its core, we're talking about a common meta data back plane, something that we used to talk about with master data management, this would be something that would be more what I would say basically, mutable, that would be more evolving, basically using, let's say, machine learning to kind of, so that we don't have to predefine rules or predefine what the world looks like. But so I think in the long run, what this really means is that whichever way we implement on whichever physical platform we implement, we need to all be speaking the same metadata language. And I think at the end of the day, regardless of whether it's a lake, warehouse or a lakehouse, we need common metadata. >> Doug, can I come back to something you pointed out? That those talking about bringing analytic and transaction databases together, you had talked about operationalizing those and the caution there. Educate me on MySQL HeatWave. I was surprised when Oracle put so much effort in that, and you may or may not be familiar with it, but a lot of folks have talked about that. Now it's got nowhere in the market, that no market share, but a lot of we've seen these benchmarks from Oracle. How real is that bringing together those two worlds and eliminating ETL? >> Yeah, I have to defer on that one. That's my colleague, Holger Mueller. He wrote the report on that. He's way deep on it and I'm not going to mock him. >> I wonder if that is something, how real that is or if it's just Oracle marketing, anybody have any thoughts on that? >> I'm pretty familiar with HeatWave. It's essentially Oracle doing what, I mean, there's kind of a parallel with what Google's doing with AlloyDB. It's an operational database that will have some embedded analytics. And it's also something which I expect to start seeing with MongoDB. And I think basically, Doug and Sanjeev were kind of referring to this before about basically kind of like the operational analytics, that are basically embedded within an operational database. The idea here is that the last thing you want to do with an operational database is slow it down. So you're not going to be doing very complex deep learning or anything like that, but you might be doing things like classification, you might be doing some predictives. In other words, we've just concluded a transaction with this customer, but was it less than what we were expecting? What does that mean in terms of, is this customer likely to turn? I think we're going to be seeing a lot of that. And I think that's what a lot of what MySQL HeatWave is all about. Whether Oracle has any presence in the market now it's still a pretty new announcement, but the other thing that kind of goes against Oracle, (laughs) that they had to battle against is that even though they own MySQL and run the open source project, everybody else, in terms of the actual commercial implementation it's associated with everybody else. And the popular perception has been that MySQL has been basically kind of like a sidelight for Oracle. And so it's on Oracles shoulders to prove that they're damn serious about it. >> There's no coincidence that MariaDB was launched the day that Oracle acquired Sun. Sanjeev, I wonder if we could come back to a topic that we discussed earlier, which is this notion of consumption, obviously Wall Street's very concerned about it. Snowflake dropped prices last week. I've always felt like, hey, the consumption model is the right model. I can dial it down in when I need to, of course, the street freaks out. What are your thoughts on just pricing, the consumption model? What's the right model for companies, for customers? >> Consumption model is here to stay. What I would like to see, and I think is an ideal situation and actually plays into the lakehouse concept is that, I have my data in some open format, maybe it's Parquet or CSV or JSON, Avro, and I can bring whatever engine is the best engine for my workloads, bring it on, pay for consumption, and then shut it down. And by the way, that could be Cloudera. We don't talk about Cloudera very much, but it could be one business unit wants to use Athena. Another business unit wants to use some other Trino let's say or Dremio. So every business unit is working on the same data set, see that's critical, but that data set is maybe in their VPC and they bring any compute engine, you pay for the use, shut it down. That then you're getting value and you're only paying for consumption. It's not like, I left a cluster running by mistake, so there have to be guardrails. The reason FinOps is so big is because it's very easy for me to run a Cartesian joint in the cloud and get a $10,000 bill. >> This looks like it's been a sort of a victim of its own success in some ways, they made it so easy to spin up single note instances, multi note instances. And back in the day when compute was scarce and costly, those database engines optimized every last bit so they could get as much workload as possible out of every instance. Today, it's really easy to spin up a new node, a new multi node cluster. So that freedom has meant many more nodes that aren't necessarily getting that utilization. So Snowflake has been doing a lot to add reporting, monitoring, dashboards around the utilization of all the nodes and multi node instances that have spun up. And meanwhile, we're seeing some of the traditional on-prem databases that are moving into the cloud, trying to offer that freedom. And I think they're going to have that same discovery that the cost surprises are going to follow as they make it easy to spin up new instances. >> Yeah, a lot of money went into this market over the last decade, separating compute from storage, moving to the cloud. I'm glad you mentioned Cloudera Sanjeev, 'cause they got it all started, the kind of big data movement. We don't talk about them that much. Sometimes I wonder if it's because when they merged Hortonworks and Cloudera, they dead ended both platforms, but then they did invest in a more modern platform. But what's the future of Cloudera? What are you seeing out there? >> Cloudera has a good product. I have to say the problem in our space is that there're way too many companies, there's way too much noise. We are expecting the end users to parse it out or we expecting analyst firms to boil it down. So I think marketing becomes a big problem. As far as technology is concerned, I think Cloudera did turn their selves around and Tony, I know you, you talked to them quite frequently. I think they have quite a comprehensive offering for a long time actually. They've created Kudu, so they got operational, they have Hadoop, they have an operational data warehouse, they're migrated to the cloud. They are in hybrid multi-cloud environment. Lot of cloud data warehouses are not hybrid. They're only in the cloud. >> Right. I think what Cloudera has done the most successful has been in the transition to the cloud and the fact that they're giving their customers more OnRamps to it, more hybrid OnRamps. So I give them a lot of credit there. They're also have been trying to position themselves as being the most price friendly in terms of that we will put more guardrails and governors on it. I mean, part of that could be spin. But on the other hand, they don't have the same vested interest in compute cycles as say, AWS would have with EMR. That being said, yes, Cloudera does it, I think its most powerful appeal so of that, it almost sounds in a way, I don't want to cast them as a legacy system. But the fact is they do have a huge landed legacy on-prem and still significant potential to land and expand that to the cloud. That being said, even though Cloudera is multifunction, I think it certainly has its strengths and weaknesses. And the fact this is that yes, Cloudera has an operational database or an operational data store with a kind of like the outgrowth of age base, but Cloudera is still based, primarily known for the deep analytics, the operational database nobody's going to buy Cloudera or Cloudera data platform strictly for the operational database. They may use it as an add-on, just in the same way that a lot of customers have used let's say Teradata basically to do some machine learning or let's say, Snowflake to parse through JSON. Again, it's not an indictment or anything like that, but the fact is obviously they do have their strengths and their weaknesses. I think their greatest opportunity is with their existing base because that base has a lot invested and vested. And the fact is they do have a hybrid path that a lot of the others lack. >> And of course being on the quarterly shock clock was not a good place to be under the microscope for Cloudera and now they at least can refactor the business accordingly. I'm glad you mentioned hybrid too. We saw Snowflake last month, did a deal with Dell whereby non-native Snowflake data could access on-prem object store from Dell. They announced a similar thing with pure storage. What do you guys make of that? Is that just... How significant will that be? Will customers actually do that? I think they're using either materialized views or extended tables. >> There are data rated and residency requirements. There are desires to have these platforms in your own data center. And finally they capitulated, I mean, Frank Klutman is famous for saying to be very focused and earlier, not many months ago, they called the going on-prem as a distraction, but clearly there's enough demand and certainly government contracts any company that has data residency requirements, it's a real need. So they finally addressed it. >> Yeah, I'll bet dollars to donuts, there was an EBC session and some big customer said, if you don't do this, we ain't doing business with you. And that was like, okay, we'll do it. >> So Dave, I have to say, earlier on you had brought this point, how Frank Klutman was poo-pooing data science workloads. On your show, about a year or so ago, he said, we are never going to on-prem. He burnt that bridge. (Tony laughs) That was on your show. >> I remember exactly the statement because it was interesting. He said, we're never going to do the halfway house. And I think what he meant is we're not going to bring the Snowflake architecture to run on-prem because it defeats the elasticity of the cloud. So this was kind of a capitulation in a way. But I think it still preserves his original intent sort of, I don't know. >> The point here is that every vendor will poo-poo whatever they don't have until they do have it. >> Yes. >> And then it'd be like, oh, we are all in, we've always been doing this. We have always supported this and now we are doing it better than others. >> Look, it was the same type of shock wave that we felt basically when AWS at the last moment at one of their reinvents, oh, by the way, we're going to introduce outposts. And the analyst group is typically pre briefed about a week or two ahead under NDA and that was not part of it. And when they dropped, they just casually dropped that in the analyst session. It's like, you could have heard the sound of lots of analysts changing their diapers at that point. >> (laughs) I remember that. And a props to Andy Jassy who once, many times actually told us, never say never when it comes to AWS. So guys, I know we got to run. We got some hard stops. Maybe you could each give us your final thoughts, Doug start us off and then-- >> Sure. Well, we've got the Snowflake Summit coming up. I'll be looking for customers that are really doing data science, that are really employing Python through Snowflake, through Snowpark. And then a couple weeks later, we've got Databricks with their Data and AI Summit in San Francisco. I'll be looking for customers that are really doing considerable BI workloads. Last year I did a market overview of this analytical data platform space, 14 vendors, eight of them claim to support lakehouse, both sides of the camp, Databricks customer had 32, their top customer that they could site was unnamed. It had 32 concurrent users doing 15,000 queries per hour. That's good but it's not up to the most demanding BI SQL workloads. And they acknowledged that and said, they need to keep working that. Snowflake asked for their biggest data science customer, they cited Kabura, 400 terabytes, 8,500 users, 400,000 data engineering jobs per day. I took the data engineering job to be probably SQL centric, ETL style transformation work. So I want to see the real use of the Python, how much Snowpark has grown as a way to support data science. >> Great. Tony. >> Actually of all things. And certainly, I'll also be looking for similar things in what Doug is saying, but I think sort of like, kind of out of left field, I'm interested to see what MongoDB is going to start to say about operational analytics, 'cause I mean, they're into this conquer the world strategy. We can be all things to all people. Okay, if that's the case, what's going to be a case with basically, putting in some inline analytics, what are you going to be doing with your query engine? So that's actually kind of an interesting thing we're looking for next week. >> Great. Sanjeev. >> So I'll be at MongoDB world, Snowflake and Databricks and very interested in seeing, but since Tony brought up MongoDB, I see that even the databases are shifting tremendously. They are addressing both the hashtag use case online, transactional and analytical. I'm also seeing that these databases started in, let's say in case of MySQL HeatWave, as relational or in MongoDB as document, but now they've added graph, they've added time series, they've added geospatial and they just keep adding more and more data structures and really making these databases multifunctional. So very interesting. >> It gets back to our discussion of best of breed, versus all in one. And it's likely Mongo's path or part of their strategy of course, is through developers. They're very developer focused. So we'll be looking for that. And guys, I'll be there as well. I'm hoping that we maybe have some extra time on theCUBE, so please stop by and we can maybe chat a little bit. Guys as always, fantastic. Thank you so much, Doug, Tony, Sanjeev, and let's do this again. >> It's been a pleasure. >> All right and thank you for watching. This is Dave Vellante for theCUBE and the excellent analyst. We'll see you next time. (upbeat music)
SUMMARY :
And Doug Henschen is the vice president Thank you. Doug let's start off with you And at the same time, me a lot of that material. And of course, at the and then we realized all the and Tony have brought to light. So I'm interested, the And in the cloud, So Sanjeev, is this all hype? But the problem is that we I mean, I look at the space, and offload some of the So different focus, at the end of the day, and warehouses on one conjoined platform. of the sort of big data movement most of the contributions made decisions. Whereas he kind of poo-pooed the lakehouse and the data scientists are from Mars. and the companies that have in the balance sheet that the customers have to worry about. the modern data stack, if you will. and the data world together, the story is with MongoDB Until data mesh takes over. and you need separate teams. that raises the importance of and the caution there. Yeah, I have to defer on that one. The idea here is that the of course, the street freaks out. and actually plays into the And back in the day when the kind of big data movement. We are expecting the end And the fact is they do have a hybrid path refactor the business accordingly. saying to be very focused And that was like, okay, we'll do it. So Dave, I have to say, the Snowflake architecture to run on-prem The point here is that and now we are doing that in the analyst session. And a props to Andy Jassy and said, they need to keep working that. Great. Okay, if that's the case, Great. I see that even the databases I'm hoping that we maybe have and the excellent analyst.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Doug | PERSON | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Tony | PERSON | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
Frank | PERSON | 0.99+ |
Frank Klutman | PERSON | 0.99+ |
Tony Baers | PERSON | 0.99+ |
Mars | LOCATION | 0.99+ |
Doug Henschen | PERSON | 0.99+ |
2020 | DATE | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Venus | LOCATION | 0.99+ |
Oracle | ORGANIZATION | 0.99+ |
2012 | DATE | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
Dell | ORGANIZATION | 0.99+ |
Hortonworks | ORGANIZATION | 0.99+ |
Holger Mueller | PERSON | 0.99+ |
Andy Jassy | PERSON | 0.99+ |
last year | DATE | 0.99+ |
$5 billion | QUANTITY | 0.99+ |
$10,000 | QUANTITY | 0.99+ |
14 vendors | QUANTITY | 0.99+ |
Last year | DATE | 0.99+ |
last week | DATE | 0.99+ |
San Francisco | LOCATION | 0.99+ |
SanjMo | ORGANIZATION | 0.99+ |
ORGANIZATION | 0.99+ | |
8,500 users | QUANTITY | 0.99+ |
Sanjeev | PERSON | 0.99+ |
Informatica | ORGANIZATION | 0.99+ |
32 concurrent users | QUANTITY | 0.99+ |
two | QUANTITY | 0.99+ |
Constellation Research | ORGANIZATION | 0.99+ |
Mongo | ORGANIZATION | 0.99+ |
Sanjeev Mohan | PERSON | 0.99+ |
Ahana | ORGANIZATION | 0.99+ |
DaaS | ORGANIZATION | 0.99+ |
EMR | ORGANIZATION | 0.99+ |
32 | QUANTITY | 0.99+ |
Atlas | ORGANIZATION | 0.99+ |
Delta | ORGANIZATION | 0.99+ |
Snowflake | ORGANIZATION | 0.99+ |
Python | TITLE | 0.99+ |
each | QUANTITY | 0.99+ |
Athena | ORGANIZATION | 0.99+ |
next week | DATE | 0.99+ |
Mark Lyons, Dremio | AWS Startup Showcase S2 E2
(upbeat music) >> Hello, everyone and welcome to theCUBE presentation of the AWS startup showcase, data as code. This is season two, episode two of the ongoing series covering the exciting startups from the AWS ecosystem. Here we're talking about operationalizing the data lake. I'm your host, John Furrier, and my guest here is Mark Lyons, VP of product management at Dremio. Great to see you, Mark. Thanks for coming on. >> Hey John, nice to see you again. Thanks for having me. >> Yeah, we were talking before we came on camera here on this showcase we're going to spend the next 20 minutes talking about the new architectures of data lakes and how they expand and scale. But we kind of were reminiscing by the old big data days, and how this really changed. There's a lot of hangovers from (mumbles) kind of fall through, Cloud took over, now we're in a new era and the theme here is data as code. Really highlights that data is now in the developer cycles of operations. So infrastructure is code-led DevOps movement for Cloud programmable infrastructure. Now you got data as code, which is really accelerating DataOps, MLOps, DatabaseOps, and more developer focus. So this is a big part of it. You guys at Dremio have a Cloud platform, query engine and a data tier innovation. Take us through the positioning of Dremio right now. What's the current state of the offering? >> Yeah, sure, so happy to, and thanks for kind of introing into the space that we're headed. I think the world is changing, and databases are changing. So today, Dremio is a full database platform, data lakehouse platform on the Cloud. So we're all about keeping your data in open formats in your Cloud storage, but bringing that full functionality that you would want to access the data, as well as manage the data. All the functionality folks would be used to from NC SQL compatibility, inserts updates, deletes on that data, keeping that data in Parquet files in the iceberg table format, another level of abstraction so that people can access the data in a very efficient way. And going even further than that, what we announced with Dremio Arctic which is in public preview on our Cloud platform, is a full get like experience for the data. So just like you said, data as code, right? We went through waves and source code and infrastructure as code. And now we can treat the data as code, which is amazing. You can have development branches, you can have staging branches, ETL branches, which are separate from production. Developers can do experiments. You can make changes, you can test those changes before you merge back to production and let the consumers see that data. Lots of innovation on the platform, super fast velocity of delivery, and lots of customers adopting it in just in the first month here since we announced Dremio Cloud generally available where the adoption's been amazing. >> Yeah, and I think we're going to dig into the a lot of the architecture, but I want to highlight your point you made about the branching off and taking a branch of Git. This is what developers do, right? The developers use GitHub, Git, they bake branches from code. They build on top of other code. That's open source. This is what's been around for generations. Now for the first time we're seeing data sets being taken out of production to be worked on and coded and tested and even doing look backs or even forward looking analysis. This is data being programmed. This is data as code. This is really, you couldn't get any closer to data as code. >> Yeah. It's all done through metadata by the way. So there's no actual copying of these data sets 'cause in these big data systems, Cloud data lakes and stuff, and these tables are billions of records, trillions of records, super wide, hundreds of columns wide, thousands of columns wide. You have to do this all through metadata operations so you can control what version of the data basically a individual's working with and which version of the data the production systems are seeing because these data sets are too big. You don't want to be moving them. You can't be moving them. You can't be copying them. It's all metadata and manifest files and pointers to basically keep track of what's going on. >> I think this is the most important trend we've seen in a long time, because if you think about what Agile did for developers, okay, speed, DevOps, Cloud scale, now you've got agility in the data side of it where you're basically breaking down the old proprietary, old ways of doing data warehousing, but not killing the functionality of what data warehouses did. Just doing more volume data warehouses where proprietary, not open. They were different use cases. They were single application developers when used data warehouse query, not a lot of volume. But as you get volume, these things are inadequate. And now you've got the new open Agile. Is this Agile data engineering at play here? >> Yeah, I think it totally is. It's bringing it as far forward in as possible. We're talking about making the data engineering process easier and more productive for the data engineer, which ultimately makes the consumers of that data much happier as well as way more experiments can happen. Way more use cases can be tried. If it's not a burden and it doesn't require building a whole new pipeline and defining a schema and adding columns and data types and all this stuff, you can do a lot more with your data much faster. So it's really going to be super impactful to all these businesses out there trying to be data driven, especially when you're looking at data as a code and branching, a branch off, you can de-risk your changes. You're not worried about messing up the production system, messing up that data, having it seen by end user. Some businesses data is their business so that data would be going all the way to a consumer, a third party. And then it gets really scary. There's a lot of risk if you show the wrong credit score to a consumer or you do something like that. So it's really de-risking... >> Even updating machine learning algorithms. So for instance, if the data sets change, you can always be iterating on things like machine learning or learning algorithms. This is kind of new. This is awesome, right? >> I think it's going to change the world because this stuff was so painful to do. The data sets had gotten so much bigger as you know, but we were still doing it in the old way, which was typically moving data around for everyone. It was copying data down, sampling data, moving data, and now we're just basically saying, hey, don't do that anymore. We got to stop moving the data. It doesn't make any sense. >> So I got to ask you Mark, data lakes are growing in popularity. I was originally down on data lakes. I called them data swamps. I didn't think they were going to be as popular because at that time, distributed file systems like Hadoop, and object store in the Cloud were really cool. So what happened between that promise of distributed file systems and object store and data lakes? What made data lakes popular? What made that work in your opinion? >> Yeah, it really comes down to the metadata, which I already mentioned once. But we went through these waves. John you saw we did the EDWs to the data lakes and then the Cloud data warehouses. I think we're at the start of a cycle back to the data lake. And it's because the data lakes this time around with the Apache iceberg table format, with project (mumbles) and what Dremio's working on around metadata, these things aren't going to become data swamps anymore. They're actually going to be functional systems that do inserts updates into leads. You can see all the commits. You can time travel them. And all the files are actually managed and optimized so you have to partition the data. You have to merge small files into larger files. Oh, by the way, this is stuff that all the warehouses have done behind the scenes and all the housekeeping they do, but people weren't really aware of it. And the data lakes the first time around didn't solve all these problems so that those files landing in a distributed file system does become a mess. If you just land JSON, Avro or Parquet files, CSV files into the HDFS, or in S3 compatible, object store doesn't matter, if you're just parking files and you're going to deal with it as schema and read instead of schema and write, you're going to have a mess. If you don't know which tool changed the files, which user deleted a file, updated a file, you will end up with a mess really quickly. So to take care of that, you have to put a table format so everyone's looking at Apache iceberg or the data bricks Delta format, which is an interesting conversation similar to the Parquet and org file format that we saw play out. And then you track the metadata. So you have those manifest files. You know which files change when, which engine, which commit. And you can actually make a functional system that's not going to become a swamp. >> Another trend that's extending on beyond the data lake is other data sources, right? So you have a lot of other data, not just in data lakes so you have to kind of work with that. How do you guys answer the question around some of the mission critical BI dashboards out there on the latency side? A lot of people have been complaining that these mission critical BI dashboards aren't getting the kind of performance as they add more data sources and they try to do more. >> Yeah, that's a great question. Dremio does actually a bunch of interesting things to bring the performance of these systems up because at the end of the day, people want to access their data really quickly. They want the response times of these dashboards to be interactive. Otherwise the data's not interesting if it takes too long to get it. To answer a question, yeah, a couple of things. First of all, from a data source's side, Dremio is very proficient with our Parquet files in an object store, like we just talked about, but it also can access data in other relational systems. So whether that's a Postgres system, whether that's a Teradata system or an Oracle system. That's really useful if you have dimensional data, customer data, not the largest data set in the world, not the fastest moving data set in the world, but you don't want to move it. We can query that where it resides. Bringing in new sources is definitely, we all know that's a key to getting better insights. It's in your data, is joining sources together. And then from a query speed standpoint, there's a lot of things going on here. Everything from kind of Apache, the Apache Avro project, which is in memory format of Parquet and not kind of serialize and de-serialize the data back and forth. As well as what we call reflection, which is basically a re-indexing or pre-computing of the data, but we leave it in Parquet format, in a open format in the customer's account so that you can have aggregates and other things that are really popular in these dashboards pre-computed. So millisecond response, lightning fast, like tricks that a warehouse would do that the warehouses have been doing forever. Right? >> Yeah, more deals coming in. And obviously the architecture we'll get into that now has to handle the growth. And as your customers and practitioners see the volume and the variety and the velocity of the data coming in, how are they adjusting their data strategies to respond to this? Again, Cloud is clearly the answer, not the data warehouse, but what are they doing? What's the strategy adjustment? >> It's interesting when we start talking to folks, I think sometimes it's a really big shift in thinking about data architectures and data strategies when you look at the Dremio approach. It's very different than what most people are doing today around ETL pipelines and then bringing stuff into a warehouse and oh, the warehouse is too overloaded so let's build some cubes and extracts into the next tier of tools to speed up those dashboards for those tools. And Dremio has totally flipped this on a sentence and said, no, let's not do all those things. That's time consuming. It's brittle, it breaks. And actually your agility and the scope of what you can do with your data decreases. You go from all your data and all your data sources to smaller and smaller. We actually call it the perimeter doom and a lot of people look at this and say, yeah, that kind of looks like how we're doing things today. So from a Dremio perspective, it's really about no copy, try to keep as much data in one place, keep it in one open format and less data movement. And that's a very different approach for people. I think they don't realize how much you can accomplish that way. And your latency shrinks down too. Your actual latency from data created to insight is much shorter. And it's not because of the query response time, that latency is mostly because of data movement and copy and all these things. So you really want to shrink your time to insight. It's not about getting a faster query from a few seconds down, it's about changing the architecture. >> The data drift as they say, interesting there. I got to ask you on the personnel side, team side, you got the technical side, you got the non-technical consumers of the data, you got the data science or data engineering is ramping up. We mentioned earlier data engineering being Agile, is a key innovation here. As you got to blend the two personas of technical and non-technical people playing with data, coding with data, we're the bottlenecks in this process today. How can data teams overcome these bottlenecks? >> I think we see a lot of bottlenecks in the process today, a lot of data movement, a lot of change requests, update this dashboard. Oh, well, that dashboard update requires an ETL pipeline update, requires a column to be added to this warehouse. So then you've got these personas, like you said, some more technical, less technical, the data consumers, the data engineers. Well, the data engineers are getting totally overloaded with requests and work. And it's not even super value-add work to the business. It's not really driving big changes in their culture and insights and new new use cases for data. It's turning through kind of small changes, but it's taking too much time. It's taking days, if not weeks for these organizations to manage small changes. And then the data consumers, the less technical folks, they can't get the answers that they want. They're waiting and waiting and waiting and they don't understand why things are so challenging, how things could take so much time. So from a Dremio perspective, it's amazing to watch these organizations unleash their data. Get the data engineers, their productivity up. Stop dealing with some of the last mile ETL and small changes to the data. And Dremio actually says, hey, data consumers, here's a really nice gooey. You don't need to be a SQL expert, well, the tool will write the joints for you. You can click on a column and say, hey, I want to calculate a new field and calculate that field. And it's all done virtually so it's not changing the physical data sets. The actual data engineering team doesn't even really need to care at that point. So you get happier data consumers at the end of the day. They're doing things more self-service. They're learning about the data and the data engineering teams can go do value-add things. They can re-architecture the platform for the future. They can do POCs to test out new technologies that could support new use cases and bring those into the organization. Things that really add value, instead of just churning through backlogs of, hey, can we get a column added or we change... Everyone's doing app development, AB testing, and those developers are king. Those pipelines stream all this data down when the JSON files change. You need agility. And if you don't have that agility, you just get this endless backlog that you never... >> This is data as code in action. You're committing data back into the main brand that's been tested. That's what developers do. So this is really kind of the next step function. I got to put the customer hat on for a second and ask you kind of the pessimist question. Okay, we've had data lakes, I've got data lakes, it's been data lakes around, I got query engines here and there, they're all over the place, what's missing? What's been missing from the architecture to fully realize the potential of a data lakehouse? >> Yeah, I think that's a great question. The customers say exactly that John. They say, "I've got 22 databases, you got to be kidding me. You showed up with another database." Or, hey, let's talk about a Cloud data lake or a data lake. Again, I did the data lake thing. I had a data lake and it wasn't everything I thought it was going to be. >> It was bad. It was data swamp. >> Yeah, so customers really think this way, and you say, well, what's different this time around? Well, the Cloud in the original data lake world, and I'm just going to focus on data lakes, so the original data lake worlds, everything was still direct attached storage, so you had to scale your storage and compute out together. And we built these huge systems. Thousands of thousands of HDFS nodes and stuff. Well, the Cloud brought the separated compute and storage, but data lakes have never seen separated compute and storage until now. We went from the data lake with directed tap storage to the Cloud data warehouse with separated compute and storage. So the Cloud architecture and getting compute and storage separated is a huge shift in the data lake world. And that agility of like, well, I'm only going to apply it, the compute that I need for this question, for this answer right now, and not get 5,000 servers of compute sitting around at some peak moment. Or just 5,000 compute servers because I have five petabytes or 50 petabytes of data that need to be stored in the discs that are attached to them. So I think the Cloud architecture and separating compute and storage is the first thing that's different this time around about data lakes. But then more importantly than that is the metadata tier. Is the data tier and having sufficient metadata to have the functionality that people need on the data lake. Whether that's for governance and compliance standpoints, to actually be able to do a delete on your data lake, or that's for productivity and treating that data as code, like we're talking about today, and being able to time travel it, version it, branch it. And now these data lakes, the data lakes back in the original days were getting to 50 petabytes. Now think about how big these Cloud data lakes could be. Even larger and you can't move that data around so we have to be really intelligent and really smart about the data operations and versioning all that data, knowing which engine touch the data, which person was the last commit and being able to track all that, is ultimately what's going to make this successful. Because if you don't have the governance in place these days with data, the projects are going to fail. >> Yeah, and I think separating the query layer or SQL layer and the data tier is another innovation that you guys have. Also it's a managed Cloud service, Dremio Cloud now. And you got the open source angle too, which is also going to open up more standardization around some of these awesome features like you mentioned the joints, and I think you guys built on top of Parquet and some other cool things. And you got a community developing, so you get the Cloud and community kind of coming together. So it's the real world that is coming to light saying, hey, I need real world applications, not the theory of old school. So what use cases do you see suited for this kind of new way, new architecture, new community, new programability? >> Yeah, I see people doing all sorts of interesting things and I'm sure with what we've introduced with Dremio Arctic and the data is code is going to open up a whole new world of things that we don't even know about today. But generally speaking, we have customers doing very interesting things, very data application things. Like building really high performance data into use cases whether that's a supply chain and manufacturing use case, whether that's a pharma or biotech use case, a banking use case, and really unleashing that data right into an application. We also see a lot of traditional data analytics use cases more in the traditional business intelligence or dashboarding use cases. That stuff is totally achievable, no problems there. But I think the most interesting stuff is companies are really figuring out how to bring that data. When we offer the flexibility that we're talking about, and the agility that we're talking about, you can really start to bring that data back into the apps, into the work streams, into the places where the business gets more value out of it. Not in a dashboard that some person might have access to, or a set of people have access to. So even in the Dremio Cloud announcement, the press release, there was a customer, they're in Europe, it's called Garvis AI and they do AI for supply chains. It's an intelligent application and it's showing customers transparently how they're getting to these predictions. And they stood this all up in a very short period of time, because it's a Cloud product. They don't have to deal with provisioning, management, upgrades. I think they had their stuff going in like 30 minutes or something, like super quick, which is amazing. The data was already there, and a lot of organizations, their data's already in these Cloud storages. And if that's the case... >> If they have data, they're a use case. This is agility. This is agility coming to the data engineering field, making data programmable, enabling the data applications, the data ops for everybody, for coding... >> For everybody. And for so many more use cases at these companies. These data engineering teams, these data platform teams, whether they're in marketing or ad tech or Fiserv or Telco, they have a list. There's a list about a roadmap of use cases that they're waiting to get to. And if they're drowning underwater in the current tooling and barely keeping that alive, and oh, by the way, John, you can't go higher 30 new data engineers tomorrow and bring on the team to get capacity. You have to innovate at the architecture level, to unlock more data use cases because you're not going to go triple your team. That's not possible. >> It's going to unlock a tsunami of value. Because everyone's clogged in the system and it's painful. Right? >> Yeah. >> They've got delays, you've got bottlenecks. you've got people complaining it's hard, scar tissue. So now I think this brings ease of use and speed to the table. >> Yeah. >> I think that's what we're all about, is making the data super easy for everyone. This should be fun and easy, not really painful and really hard and risky. In a lot of these old ways of doing things, there's a lot of risk. You start changing your ETL pipeline. You add a column to the table. All of a sudden, you've got potential risk that things are going to break and you don't even know what's going to break. >> Proprietary, not a lot of volume and usage, and on-premises, open, Cloud, Agile. (John chuckles) Come on, which path? The curtain or the box, what are you going to take? It's a no brainer. >> Which way do you want to go? >> Mark, thanks for coming on theCUBE. Really appreciate it for being part of the AWS startup showcase data as code, great conversation. Data as code is going to enable a next wave of innovation and impact the future of data analytics. Thanks for coming on theCUBE. >> Yeah, thanks John and thanks to the AWS team. A great partnership between AWS and Dremio too. Talk to you soon. >> Keep it right there, more action here on theCUBE. As part of the showcase, stay with us. This is theCUBE, your leader in tech coverage. I'm John Furrier, your host, thanks for watching. (downbeat music)
SUMMARY :
of the AWS startup showcase, data as code. Hey John, nice to see you again. and the theme here is data as code. Lots of innovation on the platform, Now for the first time the production systems are seeing in the data side of it for the data engineer, So for instance, if the data sets change, I think it's going to change the world and object store in the And it's because the data extending on beyond the data lake of the data, but we leave and the variety and the the scope of what you can do I got to ask you on the and the data engineering teams kind of the pessimist question. Again, I did the data lake thing. It was data swamp. and really smart about the data operations and the data tier is another and the data is code is going the data engineering field, and bring on the team to get capacity. Because everyone's clogged in the system to the table. is making the data The curtain or the box, and impact the future of data analytics. Talk to you soon. As part of the showcase, stay with us.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
AWS | ORGANIZATION | 0.99+ |
John | PERSON | 0.99+ |
Europe | LOCATION | 0.99+ |
John Furrier | PERSON | 0.99+ |
Mark Lyons | PERSON | 0.99+ |
30 minutes | QUANTITY | 0.99+ |
Telco | ORGANIZATION | 0.99+ |
Mark | PERSON | 0.99+ |
50 petabytes | QUANTITY | 0.99+ |
five petabytes | QUANTITY | 0.99+ |
two personas | QUANTITY | 0.99+ |
5,000 servers | QUANTITY | 0.99+ |
tomorrow | DATE | 0.99+ |
hundreds of columns | QUANTITY | 0.99+ |
22 databases | QUANTITY | 0.99+ |
Dremio | ORGANIZATION | 0.99+ |
trillions of records | QUANTITY | 0.99+ |
Dremio | PERSON | 0.99+ |
Dremio Arctic | ORGANIZATION | 0.99+ |
Fiserv | ORGANIZATION | 0.99+ |
first time | QUANTITY | 0.98+ |
30 new data engineers | QUANTITY | 0.98+ |
billions of records | QUANTITY | 0.98+ |
thousands of columns | QUANTITY | 0.98+ |
first thing | QUANTITY | 0.98+ |
Thousands of thousands | QUANTITY | 0.98+ |
today | DATE | 0.97+ |
one place | QUANTITY | 0.97+ |
Oracle | ORGANIZATION | 0.97+ |
Apache | ORGANIZATION | 0.96+ |
S3 | TITLE | 0.96+ |
Git | TITLE | 0.96+ |
Cloud | TITLE | 0.95+ |
Hadoop | TITLE | 0.95+ |
first month | QUANTITY | 0.94+ |
Parquet | TITLE | 0.94+ |
Dremio Cloud | TITLE | 0.91+ |
5,000 compute servers | QUANTITY | 0.91+ |
one | QUANTITY | 0.91+ |
JSON | TITLE | 0.89+ |
First | QUANTITY | 0.89+ |
single application | QUANTITY | 0.89+ |
Garvis | ORGANIZATION | 0.88+ |
GitHub | ORGANIZATION | 0.87+ |
Apache | TITLE | 0.82+ |
episode | QUANTITY | 0.79+ |
Agile | TITLE | 0.77+ |
season two | QUANTITY | 0.74+ |
Agile | ORGANIZATION | 0.69+ |
DevOps | TITLE | 0.67+ |
Startup Showcase S2 E2 | EVENT | 0.66+ |
Teradata | ORGANIZATION | 0.65+ |
theCUBE | ORGANIZATION | 0.64+ |
Javier de la Torre, Carto | AWS Startup Showcase S2 E2
(upbeat music) >> Hello, and welcome to theCUBE's presentation of the a AWS startup showcase, data as code is the theme. This is season two episode two of the ongoing series covering the exciting startups from the AWS ecosystem and we talk about data analytics. I'm your old John Furrier with the cube, and we have Javier De La Torre. who's the founder and chief strategy officer of Carto, which is doing some amazing innovation around geographic information systems or GIS. Javier welcome to the cube for this showcase. >> Thank you. Thank you for having me. >> So, you know, one of the things that you guys are bringing to the table is spatial analytic data that now moves into spatial relations, which is, you know, we know about geofencing. You're seeing more data coming from satellites, ground stations, you name it. Things are coming into the market from a data perspective, that's across the board and geo's one of them GIS systems. This is what you guys are doing in the rise of SQL in particular with spatial. This is a huge new benefit to the world. Can you take a minute to explain what Carto's doing and what spatial SQL is? >> Sure. Yeah. So like you said, like data, obviously we know is growing very fast and as you know now, being leveraged by many organizations in many different ways. There's one part of data, one dimension that is location. We like to say that everything happens somewhere. So therefore everything can be analyzed and understood based on the location. So we like to put an example, if all your neighbors get an alarm in their homes, the likelihood that you will get an alarm increases, right? So that's obvious we are all affected by our surroundings. What is spatial analytics, this type of analytics does is try to uncover those spacial relations so that you can model, you can predict where something is going to happen, or, you know, like, or optimize it, you know, like where else you want it to happen, right? So that's at the core of it. Now, this is something that as an industry has been done for many years, like the GIS or geographic information systems have existed for a long time. But now, and this is what Carto really brings to the table. We're looking at really the marketizing it, so that it's in the hands of any analyst, our vision is that you need to go five years, to a geography school to be able to do this type of spatial analysis. And the way that we want to make that happen is what we call with the rise of a spatial SQL. We add these capabilities around spatial analytics based on the language that is very, very popular for a analysts, which is SQL. So what we do is enables you to do this spatial analysis on top of the well known and well used SQL methods. >> It's interesting the cloud native and the cloud scale wave and now data as code has shown that the old school, the old guard, the old way of doing things, you mentioned data warehousing, okay, as one. BI tools in particular have always been limited. And the scope of the limitation was the environment was different. You have to have domain expertise, rich knowledge of the syntax. Usually it's for an application developer, not for like real time and building it into the CICD pipeline, or just from a workflow standpoint, making it available. The so-called democratization, this is where this connects. And so I got to ask you, what are you most excited about in the innovations at Carto? Can you share some of the things that people might know about or might not know about that's happening at Carto, that takes advantage of this cloud native wave because companies are now on this bandwagon. >> Yeah, no, it is. And cloud native analytics is probably the most disruptive kind of like trend that we've seen over the few years, in our particular space on the spatial it has tremendous effects on the way that we provide our service. So I'd like to kind of highlight four main reasons why cloud analytics, cloud native is super important to us. So the first one is obviously is a scalability, the working with the sizes of data that we work now in terms of location was just not possible or before. So for someone that is performing now analysis on autonomous car, or you're like that has any sensorized GPS on a device and is collecting hundreds of billions of points. If you want to do analysis on that type of data, cloud native allows you to do that in a scalable way, but it also is very cost effective. That is something that you'll see very quickly when your data grows a lot, which is that this computing storage separation, the idea that is store your data at cloud prices, but then use them with these data warehouses that we work in this private, makes for a very, very cost effective solution. But then, you know, there is other two, obviously one of them being SQL and spatial SQL that like means we like to say that SQL is becoming the lingua franca for analytics. So it's used by many products that you can connect through the usage of SQL, but I think like you coming towards why I think it's even more interesting it's like, in the cloud the concept like we all are serving, we are all living in the same infrastructure enables us that we can distribute a spatial data sets to a customer that they can join it on their database on SQL without having to move the data from one another, like in the case of Redshift or Amazon Redshift car connects and you using something called a spectrum, we can connect live to data that is stored on S3. And I think that is going to disrupt a lot the way that we think about data distributions and how cost effective it is. I think, it has a lot of your like potential on it. And in that sense what Carto is providing on top of it in the format of formats like parquet, which is a very popular with big data format. We adding geo parquet, we are specializing this big data technology for doing the spatial analysis. And that to me it is very exciting because it's putting some of the best tools at the hands of doing the space analytics for something that we're not able to do before. So to me, this is one area that I'm very, very excited. >> Well, I want to back up for a second. So you mentioned parquet and the standards around that format. And also you mentioned Redshift, so let me get this right. So you just saying that you can connect into Redshift. So I'm a customer and I have Redshift I'm using, I got my S3, I'm using Redshift for analysis. You're saying you can plug right into Redshift. >> Yes. And this is a very, very, very important part because what Carto does is leverage Redshift computing infrastructure to essentially kind of like do all the analysis. So what we do is we bring a spatial analysis where the data is, where Redshift is versus in the past, what we will do is take the data where the analysis was and that sense, it's at the core of cloud native. >> Okay. This is really where I see the exciting shift where data as code now becomes a reality is that you bring the... It redefines architecture, the script is flipped. The architecture has been redefined. You're making the data move to the environments that needs to move when it has to, if it doesn't have to move you bring compute to it. So you're seeing new kinds of use cases. So I have to ask you on the use cases and examples for Carto AWS customers with spatial analytics, what are some of the examples on how your clients are using cloud native spatial analytics or Carto? >> Yeah. So one, for example, that we've seen a lot, on the AWS ecosystem, obviously because of its suites and its position. We work together with another service in the AWS ecosystem called Amazon Location. So that actually provides you access to maps and SDKs for navigation. So it means that you are like a company that is delivering food or any other goods in the city. We have like hundreds or thousands of drivers around the city moving, doing all these deliveries. And each of these drivers they have an app and they're collecting actively their location, their position, right? So you get all the data and then it gets stored on something like a Redshift data cluster on S3 as well. There's different architectures in there, but now you essentially have like a full log of the activity that is happening on the ground from your business. So what Carto does on top of that data is you connect your data into Carto. And now you can do analysis, for example, for finding out where you user may be placed, another distribution center, you know, for optimizing your delivering routes, or like if you're in the restaurant business where you might want to have a new dark kitchen, right? So all this type of analysis based on, since I know where you're doing your operations, I can post analyze the data and then provide you a different way that you can think about solving your operation. So that's an example of a great use case that we're seeing right now. >> Talk to me about about the traditional BI tools out there, because you mentioned earlier, they lack the specific capabilities. You guys bring that to the table. What about the scalability limitations? Can you talk about where that is? Is there limitations there, obviously, if they don't have the capabilities, you can't scale that's one, but you know, as you start plugging into Redshift, scale and performance matters, what's the issue there? Can you unpack that a little bit real quick? >> Yeah. It goes back to the particulars of the spacial data, location data, like in the use case, like I was describing you very quickly are going to end up with really a lot of your like terabytes, if not petabytes of data very quickly, if you're start aggregating all this data, because it gets created by sensors. So volumes in our world kind of tends to grow a lot now. So when you work with BI tools, there's two things that you have to take in consideration. BI tools are great for seeing things like for example, if all you want to see is where your customers are, a BI tool is great. Seeing, creating a map and seeing your customers. That's totally in the world of BI. But if you want to understand why your customers are there, or where else could they be, you're going to need to perform what we call a spatial analysis. You're going to have to create a spatial model. You're going to have to, and for that BI tools will not give you that that's one side, the other it talks about the volumes that I was describing. Most of these BI tools can handle certain aggregations. Like, for example, if you are reading, if you're connecting your, let's say 10 billion data set to a BI tool, the BI tool will do some aggregations because you cannot display 10,000 rows on a BI tool and that's okay, you get aggregations and that works. But when it comes to a map, you cannot aggregate the data on the map. You actually want to see all the data on the map, and that's what Carto provides you. It allows you to make maps that sees all the data, not just aggregated by county or aggregated by other kind of like area, you see all your data on the map. >> You know, what's interesting is that location based service has been around for a long time. You know, when mobile started even hitting the scene, you saw it get better mashups, Google Maps, all this Google API mashups, things like that. You know, developers are used to it, but they could never get to the promised land on the big data side, because they just didn't have the compute. But now you add in geofencing, geo information, you now have access to this new edge like data, right? So I have to ask you on the mobile side, are you guys working with any 5G or edge providers? Because I can almost imagine that the spatial equation gets more complicated and more data full when you start blowing out edge data, like with 5G, you got more, more things happening at the edge. It's only going to fill in more data points. Can you share that's how that use case is going with mobile, mobile carriers or 5G? >> Yeah, that's totally, yeah. It's totally the case. Well, first, even before, you know, like we are there, we actually helping a lot of telcos on actually planning the 5G deployment. Where do you place your antennas is a very, very important topic when you're like talking about 5G. Because you know, like 5G networks require a lot of density. So it's a lot about like, okay, where do I start deploying my infrastructure to ensure the customers like meet, like have the best service and the places where I want to kind of like go first So like... >> You mean like the RF maps, like understanding how RF propagates. >> Well, that's one signal, but the other is like, imagine that your telco is more interested on, you know, let's say on a certain kind of like consumer profile, like young people that are using the one type of service. Well, we know where these demographics kind of lives. So you might want to start kind of like deploying your 5G in those areas, right. Versus if you go to more commercial and more kind of like residential areas, there might be other demographics. So that's one part around market analysis. Then the second part is once these 5G networks are in place, you're right. I mean, one of the premises that kind of like these news technologies give us is because the network is much smarter. You can have all these edge cases, there's much more location data that can be collected. So what we see now is a rise on the amount of what we call telemetry. That for example, the IOT space can make around location. And that's now enabled because of 5G. So I think 5G is going to be one of those trends that are going to make like more and more data coming into, I mean, more location, data available for analysis. >> So how does that, I mean, this is a great conversation because everyone can realize they're at a stadium and they see multiple bars but they can't get bandwidth. So they got a back haul problem or not enough signal. Everyone knows when they're driving their car, they know they can relate to the consumer side of it. So I get how the spatial data grows. What's the impact to Carto and specifically the cloud, because if you have more data coming in, you need the actionable insight. So I can see the use case, oh, put the antenna here. That's an actionable business decision, more content, more revenue, more happy customers, but where else is the impact to you guys and the spatial piece of it? >> Yeah. Well, I mean like there's many, many factors, right? So one of them, for example, on the telco, one of the things where we realize impact is that it gives the visibility to the operator, for example, around the quality of service. Like, okay, are my customers getting the quality of services where I want? Or like you said, like if there sitting outside a concert the quality of service in one particular area is dropping very fast. So the idea of like being able to now in real time, kind of like detect location issues, like I'm having an issue in this place. That means that then now I can act, I can drive up bandwidth, put more capacity et cetera right. So I think the biggest impact that we are seeing we are going to see on the upcoming years is that like more and more use cases going towards real time. So where, like before it was like, well, now that it has happened, I'm going to analyze it. I'm going to look at, you know, like how I could do better next time towards a more of like an industry where Carto ourselves, we are embedded in more real time type of, you know, like analytics where it's okay, if this happens, then do that, right. So it's going to be more personalized at the level that like in the code environment, it has to be art of a full kind of like pipeline kind of like type of analysis. That's already programmatically prepared to act on real time. >> That's great and it's a good segue. My next question, as more and more companies adopt cloud native analytics, what trends are you seeing out of the key to watch? Obviously you're seeing more developers coming on site, on the scene, open sources growing, what's the big cloud native analytics trends for Carto and geographic information. >> Yeah. So I think you know like the, we were talking before the cloud native now is unstoppable, but one of the things that we are seeing that is still needs to be developed and we are seeing progress is around a standardization, for example, around like data sets that are provided by different providers. What I mean with that is like, you as an organization, you're going to be responsible for your data like that you create on your cloud, right. On S3, or, you know and then you going to have a competing engine, like Redshift and you're going to have all that set up, but then you also going to have to think about like, okay, how do I ingest data from third party providers that are important for my analysis? So for example, Carto provides a lot of demographics, human mobility. we aggregate and clean up and prepare lot of spacial data so that we can then enrich your business. So for us, how we deliver that into your cloud native solution is a very important factor. And we haven't seen yet enough standardization around that. And that's one of the things, what we are pushing, you know, with the concept of geo Parquet of standardizing that body. That's one, then there is another, this is more what I like to say that you know, we are helping companies figure out their own geographies. What we mean by that is like most companies, when they start thinking about like how they interact, on the space, on the location, some of them will work like by zip codes and other by cities, they organize their operations based on a geography in a way, or technically what we call a geographic support system. Well, nowadays, like the most advance companies are defining their geographies in a continuous spectrum in what we call global grid system or spatial indexes that allows them to understand the business, not just as a set of regions, but as a continuous space. And that is now possible because of the technologies that we are introducing around spatial indexes at the cloud native infrastructure. And it provides a great a way to match data with resources and operate at scale. To me those two trends are going to be like very, very important because of the capabilities that cloud native brings to our spatial industry. >> So it changes the operation. So it's data as ops, data as code, is data ops, like infrastructures code means cloud DevOps. So I got to ask you because that's cool. Spatial index is a whole another way to think of it, rather than you go hyper local, super local, you get local zones for AWS and regions. Things are getting down to the granular levels I see that. So I have to ask you, what does data as code mean to you and what does it mean to Carto? Because you're kind of teasing at this new way because it's redefining the operation, the data operations, data engineering. So data as code is real. What does that mean to you? >> No, I think we already seeing it happening to me and to Carto what I will describe data as code is when an organization has moved from doing an analysis after the fact, like where they're like post kind of like analysis in a way to where they're actually kind of like putting analytics on their operational cycle. So then they need to really code it. They need to make these analysis, put them and insert them into the architecture bus, if you want to say of the organization. So if I get a customer, happens to be in this location, I'm going to trigger that and then this is going to do that. Or if this happens, I'm need to open up. And this is where if an organization is going to react in more real time, and we know that organizations need to drive in that direction, the only way that they can make that happen is if they operationalize analytics on their daily operations. And that can only happen with data as code. >> Yeah. And that's interesting. Look at ML ops, AI ops, people talk about that. This is data, so developers meets operations, that's the cloud, data meets code that's operations, that's data business. >> You got it. And add to that, the spacial with Carto and we go it. >> Yeah, because every piece of data now is important. And the spatial's key real quick before we close out, what is the index thing? Explain the benefit real quick of a spatial index. >> Yes. So the spatial index is well everybody can understand how we organize societies politically, right? Our countries, you have like states and then you have like counties and you have all these different kind, what we call administrative boundaries, right? That's a way that we organize information too, right? A spatial index is when you divide the world, not in administrative boundaries, but you actually make a grid. Imagine that you just essentially make a grid of the world. right? And you make that grid so that in every cell you can then split it into, let's say for example, four more cells. So you now have like an organization. You split the world in a grid that you can have multiple resolutions think like Google maps when you see the entire world, but you can zoom in and you end up seeing, you know, like one particular place, so that's one thing. So what a spatial indexes allows you is to technically put, you know like your location, not based coordinate, but actually on one grid place on an index. And we use that then later to correlate, let's say your data with someone else data, as we can use what we call this spatial indexes to do joints very, very fast and we can do a lot of operations with it. So it is a new way to do spatial computing based on this type of indexes, but for more than anything for an organization, what spatial index allows is that you don't need to work on zip codes or in boundaries on artificial boundaries. I mean, your customer doesn't change because he goes from this place to the road, to the other side of the road, this is the same place. It's an arbitrary in location. It's a spatial index break out all of that. You're like you break with your zip codes, you break. And you essentially have a continuous geography, that actually is a much closer look up to the reality. >> It's like the forest and the trees and the bark of the tree. (Javier laughing) You can see everything. >> That's it, you can get a look at everything. >> Javi, great to have you on. In real quick closing give a quick plug for the company, summarize what you do, what you're looking into, how many people you got, when you're hiring, what's the key goals for the company? >> Yeah, sure. So Carto is a company, now we are around 200 people. Our vision is that spatial analytics is something that every organization should do. So we really try to enable organizations with the best data and analysis around spatial. And we do all that cloud native on top of your data warehouse. So what we are really in enabling these organizations is to take that cloud native approach that they're already embracing it also to spatial analysis. >> Javi, founder, chief strategy officer for Carto. Great to have you on data as code, all data's real, all data has impact, operational impact with data is the new big trend. Thanks for coming on and sharing the company story and all your key innovations. Thank you. >> Thanks to you. >> Okay. This is the startup showcase. Data as code, season two episode two of the ongoing series. Every episode will explore new topics and new exciting companies pioneering this next cloud native wave of innovation. I'm John Furrier, your host of theCUBE. Thanks for watching. (upbeat music)
SUMMARY :
data as code is the theme. Thank you for having me. one of the things that you guys the likelihood that you will shown that the old school, products that you can connect So you just saying that you like do all the analysis. So I have to ask you on the use cases So it means that you are like a company You guys bring that to the table. So when you work with BI tools, So I have to ask you on the mobile side, and the places where I want You mean like the RF maps, on the amount of what we call telemetry. So I can see the use case, I'm going to look at, you know, out of the key to watch? that you create on your cloud, right. So I got to ask you because that's cool. and to Carto what I will operations, that's the cloud, And add to that, the spacial And the spatial's key real is to technically put, you and the bark of the tree. That's it, you can Javi, great to have you on. is to take that cloud native approach Great to have you on data and new exciting companies pioneering
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Javier De La Torre | PERSON | 0.99+ |
Javi | PERSON | 0.99+ |
John Furrier | PERSON | 0.99+ |
Carto | ORGANIZATION | 0.99+ |
10,000 rows | QUANTITY | 0.99+ |
hundreds | QUANTITY | 0.99+ |
Javier | PERSON | 0.99+ |
five years | QUANTITY | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
one part | QUANTITY | 0.99+ |
Redshift | TITLE | 0.99+ |
SQL | TITLE | 0.99+ |
second part | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
each | QUANTITY | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
two things | QUANTITY | 0.99+ |
two | QUANTITY | 0.98+ |
one side | QUANTITY | 0.98+ |
first | QUANTITY | 0.98+ |
Google maps | TITLE | 0.97+ |
5G | ORGANIZATION | 0.97+ |
around 200 people | QUANTITY | 0.97+ |
one dimension | QUANTITY | 0.97+ |
Google Maps | TITLE | 0.97+ |
one signal | QUANTITY | 0.97+ |
Carto | PERSON | 0.96+ |
two trends | QUANTITY | 0.96+ |
telco | ORGANIZATION | 0.95+ |
one area | QUANTITY | 0.95+ |
Javier de la Torre, Carto | PERSON | 0.94+ |
four more cells | QUANTITY | 0.92+ |
10 billion data | QUANTITY | 0.92+ |
first one | QUANTITY | 0.91+ |
hundreds of | QUANTITY | 0.9+ |
one thing | QUANTITY | 0.9+ |
S3 | TITLE | 0.89+ |
parquet | TITLE | 0.84+ |
Carto | TITLE | 0.83+ |
season two | QUANTITY | 0.82+ |
petabytes | QUANTITY | 0.77+ |
billions of points | QUANTITY | 0.76+ |
Redshift | ORGANIZATION | 0.76+ |
one grid | QUANTITY | 0.75+ |
episode | QUANTITY | 0.75+ |
Analyst Predictions 2022: The Future of Data Management
[Music] in the 2010s organizations became keenly aware that data would become the key ingredient in driving competitive advantage differentiation and growth but to this day putting data to work remains a difficult challenge for many if not most organizations now as the cloud matures it has become a game changer for data practitioners by making cheap storage and massive processing power readily accessible we've also seen better tooling in the form of data workflows streaming machine intelligence ai developer tools security observability automation new databases and the like these innovations they accelerate data proficiency but at the same time they had complexity for practitioners data lakes data hubs data warehouses data marts data fabrics data meshes data catalogs data oceans are forming they're evolving and exploding onto the scene so in an effort to bring perspective to the sea of optionality we've brought together the brightest minds in the data analyst community to discuss how data management is morphing and what practitioners should expect in 2022 and beyond hello everyone my name is dave vellante with the cube and i'd like to welcome you to a special cube presentation analyst predictions 2022 the future of data management we've gathered six of the best analysts in data and data management who are going to present and discuss their top predictions and trends for 2022 in the first half of this decade let me introduce our six power panelists sanjeev mohan is former gartner analyst and principal at sanjamo tony bear is principal at db insight carl olufsen is well-known research vice president with idc dave meninger is senior vice president and research director at ventana research brad shimon chief analyst at ai platforms analytics and data management at omnia and doug henschen vice president and principal analyst at constellation research gentlemen welcome to the program and thanks for coming on thecube today great to be here thank you all right here's the format we're going to use i as moderator are going to call on each analyst separately who then will deliver their prediction or mega trend and then in the interest of time management and pace two analysts will have the opportunity to comment if we have more time we'll elongate it but let's get started right away sanjeev mohan please kick it off you want to talk about governance go ahead sir thank you dave i i believe that data governance which we've been talking about for many years is now not only going to be mainstream it's going to be table stakes and all the things that you mentioned you know with data oceans data lakes lake houses data fabric meshes the common glue is metadata if we don't understand what data we have and we are governing it there is no way we can manage it so we saw informatica when public last year after a hiatus of six years i've i'm predicting that this year we see some more companies go public uh my bet is on colibra most likely and maybe alation we'll see go public this year we we i'm also predicting that the scope of data governance is going to expand beyond just data it's not just data and reports we are going to see more transformations like spark jaws python even airflow we're going to see more of streaming data so from kafka schema registry for example we will see ai models become part of this whole governance suite so the governance suite is going to be very comprehensive very detailed lineage impact analysis and then even expand into data quality we already seen that happen with some of the tools where they are buying these smaller companies and bringing in data quality monitoring and integrating it with metadata management data catalogs also data access governance so these so what we are going to see is that once the data governance platforms become the key entry point into these modern architectures i'm predicting that the usage the number of users of a data catalog is going to exceed that of a bi tool that will take time and we already seen that that trajectory right now if you look at bi tools i would say there are 100 users to a bi tool to one data catalog and i i see that evening out over a period of time and at some point data catalogs will really become you know the main way for us to access data data catalog will help us visualize data but if we want to do more in-depth analysis it'll be the jumping-off point into the bi tool the data science tool and and that is that is the journey i see for the data governance products excellent thank you some comments maybe maybe doug a lot a lot of things to weigh in on there maybe you could comment yeah sanjeev i think you're spot on a lot of the trends uh the one disagreement i think it's it's really still far from mainstream as you say we've been talking about this for years it's like god motherhood apple pie everyone agrees it's important but too few organizations are really practicing good governance because it's hard and because the incentives have been lacking i think one thing that deserves uh mention in this context is uh esg mandates and guidelines these are environmental social and governance regs and guidelines we've seen the environmental rags and guidelines imposed in industries particularly the carbon intensive industries we've seen the social mandates particularly diversity imposed on suppliers by companies that are leading on this topic we've seen governance guidelines now being imposed by banks and investors so these esgs are presenting new carrots and sticks and it's going to demand more solid data it's going to demand more detailed reporting and solid reporting tighter governance but we're still far from mainstream adoption we have a lot of uh you know best of breed niche players in the space i think the signs that it's going to be more mainstream are starting with things like azure purview google dataplex the big cloud platform uh players seem to be uh upping the ante and and addressing starting to address governance excellent thank you doug brad i wonder if you could chime in as well yeah i would love to be a believer in data catalogs um but uh to doug's point i think that it's going to take some more pressure for for that to happen i recall metadata being something every enterprise thought they were going to get under control when we were working on service oriented architecture back in the 90s and that didn't happen quite the way we we anticipated and and uh to sanjeev's point it's because it is really complex and really difficult to do my hope is that you know we won't sort of uh how do we put this fade out into this nebulous nebula of uh domain catalogs that are specific to individual use cases like purview for getting data quality right or like data governance and cyber security and instead we have some tooling that can actually be adaptive to gather metadata to create something i know is important to you sanjeev and that is this idea of observability if you can get enough metadata without moving your data around but understanding the entirety of a system that's running on this data you can do a lot to help with with the governance that doug is talking about so so i just want to add that you know data governance like many other initiatives did not succeed even ai went into an ai window but that's a different topic but a lot of these things did not succeed because to your point the incentives were not there i i remember when starbucks oxley had come into the scene if if a bank did not do service obviously they were very happy to a million dollar fine that was like you know pocket change for them instead of doing the right thing but i think the stakes are much higher now with gdpr uh the floodgates open now you know california you know has ccpa but even ccpa is being outdated with cpra which is much more gdpr like so we are very rapidly entering a space where every pretty much every major country in the world is coming up with its own uh compliance regulatory requirements data residence is becoming really important and and i i think we are going to reach a stage where uh it won't be optional anymore so whether we like it or not and i think the reason data catalogs were not successful in the past is because we did not have the right focus on adoption we were focused on features and these features were disconnected very hard for business to stop these are built by it people for it departments to to take a look at technical metadata not business metadata today the tables have turned cdo's are driving this uh initiative uh regulatory compliances are beating down hard so i think the time might be right yeah so guys we have to move on here and uh but there's some some real meat on the bone here sanjeev i like the fact that you late you called out calibra and alation so we can look back a year from now and say okay he made the call he stuck it and then the ratio of bi tools the data catalogs that's another sort of measurement that we can we can take even though some skepticism there that's something that we can watch and i wonder if someday if we'll have more metadata than data but i want to move to tony baer you want to talk about data mesh and speaking you know coming off of governance i mean wow you know the whole concept of data mesh is decentralized data and then governance becomes you know a nightmare there but take it away tony we'll put it this way um data mesh you know the the idea at least is proposed by thoughtworks um you know basically was unleashed a couple years ago and the press has been almost uniformly almost uncritical um a good reason for that is for all the problems that basically that sanjeev and doug and brad were just you know we're just speaking about which is that we have all this data out there and we don't know what to do about it um now that's not a new problem that was a problem we had enterprise data warehouses it was a problem when we had our hadoop data clusters it's even more of a problem now the data's out in the cloud where the data is not only your data like is not only s3 it's all over the place and it's also including streaming which i know we'll be talking about later so the data mesh was a response to that the idea of that we need to debate you know who are the folks that really know best about governance is the domain experts so it was basically data mesh was an architectural pattern and a process my prediction for this year is that data mesh is going to hit cold hard reality because if you if you do a google search um basically the the published work the articles and databases have been largely you know pretty uncritical um so far you know that you know basically learning is basically being a very revolutionary new idea i don't think it's that revolutionary because we've talked about ideas like this brad and i you and i met years ago when we were talking about so and decentralizing all of us was at the application level now we're talking about at the data level and now we have microservices so there's this thought of oh if we manage if we're apps in cloud native through microservices why don't we think of data in the same way um my sense this year is that you know this and this has been a very active search if you look at google search trends is that now companies are going to you know enterprises are going to look at this seriously and as they look at seriously it's going to attract its first real hard scrutiny it's going to attract its first backlash that's not necessarily a bad thing it means that it's being taken seriously um the reason why i think that that uh that it will you'll start to see basically the cold hard light of day shine on data mesh is that it's still a work in progress you know this idea is basically a couple years old and there's still some pretty major gaps um the biggest gap is in is in the area of federated governance now federated governance itself is not a new issue uh federated governance position we're trying to figure out like how can we basically strike the balance between getting let's say you know between basically consistent enterprise policy consistent enterprise governance but yet the groups that understand the data know how to basically you know that you know how do we basically sort of balance the two there's a huge there's a huge gap there in practice and knowledge um also to a lesser extent there's a technology gap which is basically in the self-service technologies that will help teams essentially govern data you know basically through the full life cycle from developed from selecting the data from you know building the other pipelines from determining your access control determining looking at quality looking at basically whether data is fresh or whether or not it's trending of course so my predictions is that it will really receive the first harsh scrutiny this year you are going to see some organization enterprises declare premature victory when they've uh when they build some federated query implementations you're going to see vendors start to data mesh wash their products anybody in the data management space they're going to say that whether it's basically a pipelining tool whether it's basically elt whether it's a catalog um or confederated query tool they're all going to be like you know basically promoting the fact of how they support this hopefully nobody is going to call themselves a data mesh tool because data mesh is not a technology we're going to see one other thing come out of this and this harks back to the metadata that sanji was talking about and the catalogs that he was talking about which is that there's going to be a new focus on every renewed focus on metadata and i think that's going to spur interest in data fabrics now data fabrics are pretty vaguely defined but if we just take the most elemental definition which is a common metadata back plane i think that if anybody is going to get serious about data mesh they need to look at a data fabric because we all at the end of the day need to speak you know need to read from the same sheet of music so thank you tony dave dave meninger i mean one of the things that people like about data mesh is it pretty crisply articulates some of the flaws in today's organizational approaches to data what are your thoughts on this well i think we have to start by defining data mesh right the the term is already getting corrupted right tony said it's going to see the cold hard uh light of day and there's a problem right now that there are a number of overlapping terms that are similar but not identical so we've got data virtualization data fabric excuse me for a second sorry about that data virtualization data fabric uh uh data federation right uh so i i think that it's not really clear what each vendor means by these terms i see data mesh and data fabric becoming quite popular i've i've interpreted data mesh as referring primarily to the governance aspects as originally you know intended and specified but that's not the way i see vendors using i see vendors using it much more to mean data fabric and data virtualization so i'm going to comment on the group of those things i think the group of those things is going to happen they're going to happen they're going to become more robust our research suggests that a quarter of organizations are already using virtualized access to their data lakes and another half so a total of three quarters will eventually be accessing their data lakes using some sort of virtualized access again whether you define it as mesh or fabric or virtualization isn't really the point here but this notion that there are different elements of data metadata and governance within an organization that all need to be managed collectively the interesting thing is when you look at the satisfaction rates of those organizations using virtualization versus those that are not it's almost double 68 of organizations i'm i'm sorry um 79 of organizations that were using virtualized access express satisfaction with their access to the data lake only 39 expressed satisfaction if they weren't using virtualized access so thank you uh dave uh sanjeev we just got about a couple minutes on this topic but i know you're speaking or maybe you've spoken already on a panel with jamal dagani who sort of invented the concept governance obviously is a big sticking point but what are your thoughts on this you are mute so my message to your mark and uh and to the community is uh as opposed to what dave said let's not define it we spent the whole year defining it there are four principles domain product data infrastructure and governance let's take it to the next level i get a lot of questions on what is the difference between data fabric and data mesh and i'm like i can compare the two because data mesh is a business concept data fabric is a data integration pattern how do you define how do you compare the two you have to bring data mesh level down so to tony's point i'm on a warp path in 2022 to take it down to what does a data product look like how do we handle shared data across domains and govern it and i think we are going to see more of that in 2022 is operationalization of data mesh i think we could have a whole hour on this topic couldn't we uh maybe we should do that uh but let's go to let's move to carl said carl your database guy you've been around that that block for a while now you want to talk about graph databases bring it on oh yeah okay thanks so i regard graph database as basically the next truly revolutionary database management technology i'm looking forward to for the graph database market which of course we haven't defined yet so obviously i have a little wiggle room in what i'm about to say but that this market will grow by about 600 percent over the next 10 years now 10 years is a long time but over the next five years we expect to see gradual growth as people start to learn how to use it problem isn't that it's used the problem is not that it's not useful is that people don't know how to use it so let me explain before i go any further what a graph database is because some of the folks on the call may not may not know what it is a graph database organizes data according to a mathematical structure called a graph a graph has elements called nodes and edges so a data element drops into a node the nodes are connected by edges the edges connect one node to another node combinations of edges create structures that you can analyze to determine how things are related in some cases the nodes and edges can have properties attached to them which add additional informative material that makes it richer that's called a property graph okay there are two principal use cases for graph databases there's there's semantic proper graphs which are used to break down human language text uh into the semantic structures then you can search it organize it and and and answer complicated questions a lot of ai is aimed at semantic graphs another kind is the property graph that i just mentioned which has a dazzling number of use cases i want to just point out is as i talk about this people are probably wondering well we have relational databases isn't that good enough okay so a relational database defines it uses um it supports what i call definitional relationships that means you define the relationships in a fixed structure the database drops into that structure there's a value foreign key value that relates one table to another and that value is fixed you don't change it if you change it the database becomes unstable it's not clear what you're looking at in a graph database the system is designed to handle change so that it can reflect the true state of the things that it's being used to track so um let me just give you some examples of use cases for this um they include uh entity resolution data lineage uh um social media analysis customer 360 fraud prevention there's cyber security there's strong supply chain is a big one actually there's explainable ai and this is going to become important too because a lot of people are adopting ai but they want a system after the fact to say how did the ai system come to that conclusion how did it make that recommendation right now we don't have really good ways of tracking that okay machine machine learning in general um social network i already mentioned that and then we've got oh gosh we've got data governance data compliance risk management we've got recommendation we've got personalization anti-money money laundering that's another big one identity and access management network and i.t operations is already becoming a key one where you actually have mapped out your operation your your you know whatever it is your data center and you you can track what's going on as things happen there root cause analysis fraud detection is a huge one a number of major credit card companies use graph databases for fraud detection risk analysis tracking and tracing churn analysis next best action what-if analysis impact analysis entity resolution and i would add one other thing or just a few other things to this list metadata management so sanjay here you go this is your engine okay because i was in metadata management for quite a while in my past life and one of the things i found was that none of the data management technologies that were available to us could efficiently handle metadata because of the kinds of structures that result from it but grass can okay grafts can do things like say this term in this context means this but in that context it means that okay things like that and in fact uh logistics management supply chain it also because it handles recursive relationships by recursive relationships i mean objects that own other objects that are of the same type you can do things like bill materials you know so like parts explosion you can do an hr analysis who reports to whom how many levels up the chain and that kind of thing you can do that with relational databases but yes it takes a lot of programming in fact you can do almost any of these things with relational databases but the problem is you have to program it it's not it's not supported in the database and whenever you have to program something that means you can't trace it you can't define it you can't publish it in terms of its functionality and it's really really hard to maintain over time so carl thank you i wonder if we could bring brad in i mean brad i'm sitting there wondering okay is this incremental to the market is it disruptive and replaceable what are your thoughts on this space it's already disrupted the market i mean like carl said go to any bank and ask them are you using graph databases to do to get fraud detection under control and they'll say absolutely that's the only way to solve this problem and it is frankly um and it's the only way to solve a lot of the problems that carl mentioned and that is i think it's it's achilles heel in some ways because you know it's like finding the best way to cross the seven bridges of konigsberg you know it's always going to kind of be tied to those use cases because it's really special and it's really unique and because it's special and it's unique uh it it still unfortunately kind of stands apart from the rest of the community that's building let's say ai outcomes as the great great example here the graph databases and ai as carl mentioned are like chocolate and peanut butter but technologically they don't know how to talk to one another they're completely different um and you know it's you can't just stand up sql and query them you've got to to learn um yeah what is that carlos specter or uh special uh uh yeah thank you uh to actually get to the data in there and if you're gonna scale that data that graph database especially a property graph if you're gonna do something really complex like try to understand uh you know all of the metadata in your organization you might just end up with you know a graph database winter like we had the ai winter simply because you run out of performance to make the thing happen so i i think it's already disrupted but we we need to like treat it like a first-class citizen in in the data analytics and ai community we need to bring it into the fold we need to equip it with the tools it needs to do that the magic it does and to do it not just for specialized use cases but for everything because i i'm with carl i i think it's absolutely revolutionary so i had also identified the principal achilles heel of the technology which is scaling now when these when these things get large and complex enough that they spill over what a single server can handle you start to have difficulties because the relationships span things that have to be resolved over a network and then you get network latency and that slows the system down so that's still a problem to be solved sanjeev any quick thoughts on this i mean i think metadata on the on the on the word cloud is going to be the the largest font uh but what are your thoughts here i want to like step away so people don't you know associate me with only meta data so i want to talk about something a little bit slightly different uh dbengines.com has done an amazing job i think almost everyone knows that they chronicle all the major databases that are in use today in january of 2022 there are 381 databases on its list of ranked list of databases the largest category is rdbms the second largest category is actually divided into two property graphs and rdf graphs these two together make up the second largest number of data databases so talking about accolades here this is a problem the problem is that there's so many graph databases to choose from they come in different shapes and forms uh to bright's point there's so many query languages in rdbms is sql end of the story here we've got sci-fi we've got gremlin we've got gql and then your proprietary languages so i think there's a lot of disparity in this space but excellent all excellent points sanji i must say and that is a problem the languages need to be sorted and standardized and it needs people need to have a road map as to what they can do with it because as you say you can do so many things and so many of those things are unrelated that you sort of say well what do we use this for i'm reminded of the saying i learned a bunch of years ago when somebody said that the digital computer is the only tool man has ever devised that has no particular purpose all right guys we gotta we gotta move on to dave uh meninger uh we've heard about streaming uh your prediction is in that realm so please take it away sure so i like to say that historical databases are to become a thing of the past but i don't mean that they're going to go away that's not my point i mean we need historical databases but streaming data is going to become the default way in which we operate with data so in the next say three to five years i would expect the data platforms and and we're using the term data platforms to represent the evolution of databases and data lakes that the data platforms will incorporate these streaming capabilities we're going to process data as it streams into an organization and then it's going to roll off into historical databases so historical databases don't go away but they become a thing of the past they store the data that occurred previously and as data is occurring we're going to be processing it we're going to be analyzing we're going to be acting on it i mean we we only ever ended up with historical databases because we were limited by the technology that was available to us data doesn't occur in batches but we processed it in batches because that was the best we could do and it wasn't bad and we've continued to improve and we've improved and we've improved but streaming data today is still the exception it's not the rule right there's there are projects within organizations that deal with streaming data but it's not the default way in which we deal with data yet and so that that's my prediction is that this is going to change we're going to have um streaming data be the default way in which we deal with data and and how you label it what you call it you know maybe these databases and data platforms just evolve to be able to handle it but we're going to deal with data in a different way and our research shows that already about half of the participants in our analytics and data benchmark research are using streaming data you know another third are planning to use streaming technologies so that gets us to about eight out of ten organizations need to use this technology that doesn't mean they have to use it throughout the whole organization but but it's pretty widespread in its use today and has continued to grow if you think about the consumerization of i.t we've all been conditioned to expect immediate access to information immediate responsiveness you know we want to know if an uh item is on the shelf at our local retail store and we can go in and pick it up right now you know that's the world we live in and that's spilling over into the enterprise i.t world where we have to provide those same types of capabilities um so that's my prediction historical database has become a thing of the past streaming data becomes the default way in which we we operate with data all right thank you david well so what what say you uh carl a guy who's followed historical databases for a long time well one thing actually every database is historical because as soon as you put data in it it's now history it's no longer it no longer reflects the present state of things but even if that history is only a millisecond old it's still history but um i would say i mean i know you're trying to be a little bit provocative in saying this dave because you know as well as i do that people still need to do their taxes they still need to do accounting they still need to run general ledger programs and things like that that all involves historical data that's not going to go away unless you want to go to jail so you're going to have to deal with that but as far as the leading edge functionality i'm totally with you on that and i'm just you know i'm just kind of wondering um if this chain if this requires a change in the way that we perceive applications in order to truly be manifested and rethinking the way m applications work um saying that uh an application should respond instantly as soon as the state of things changes what do you say about that i i think that's true i think we do have to think about things differently that's you know it's not the way we design systems in the past uh we're seeing more and more systems designed that way but again it's not the default and and agree 100 with you that we do need historical databases you know that that's clear and even some of those historical databases will be used in conjunction with the streaming data right so absolutely i mean you know let's take the data warehouse example where you're using the data warehouse as context and the streaming data as the present you're saying here's a sequence of things that's happening right now have we seen that sequence before and where what what does that pattern look like in past situations and can we learn from that so tony bear i wonder if you could comment i mean if you when you think about you know real-time inferencing at the edge for instance which is something that a lot of people talk about um a lot of what we're discussing here in this segment looks like it's got great potential what are your thoughts yeah well i mean i think you nailed it right you know you hit it right on the head there which is that i think a key what i'm seeing is that essentially and basically i'm going to split this one down the middle is i don't see that basically streaming is the default what i see is streaming and basically and transaction databases um and analytics data you know data warehouses data lakes whatever are converging and what allows us technically to converge is cloud native architecture where you can basically distribute things so you could have you can have a note here that's doing the real-time processing that's also doing it and this is what your leads in we're maybe doing some of that real-time predictive analytics to take a look at well look we're looking at this customer journey what's happening with you know you know with with what the customer is doing right now and this is correlated with what other customers are doing so what i so the thing is that in the cloud you can basically partition this and because of basically you know the speed of the infrastructure um that you can basically bring these together and or and so and kind of orchestrate them sort of loosely coupled manner the other part is that the use cases are demanding and this is part that goes back to what dave is saying is that you know when you look at customer 360 when you look at let's say smart you know smart utility grids when you look at any type of operational problem it has a real-time component and it has a historical component and having predictives and so like you know you know my sense here is that there that technically we can bring this together through the cloud and i think the use case is that is that we we can apply some some real-time sort of you know predictive analytics on these streams and feed this into the transactions so that when we make a decision in terms of what to do as a result of a transaction we have this real time you know input sanjeev did you have a comment yeah i was just going to say that to this point you know we have to think of streaming very different because in the historical databases we used to bring the data and store the data and then we used to run rules on top uh aggregations and all but in case of streaming the mindset changes because the rules normally the inference all of that is fixed but the data is constantly changing so it's a completely reverse way of thinking of uh and building applications on top of that so dave menninger there seemed to be some disagreement about the default or now what kind of time frame are you are you thinking about is this end of decade it becomes the default what would you pin i i think around you know between between five to ten years i think this becomes the reality um i think you know it'll be more and more common between now and then but it becomes the default and i also want sanjeev at some point maybe in one of our subsequent conversations we need to talk about governing streaming data because that's a whole other set of challenges we've also talked about it rather in a two dimensions historical and streaming and there's lots of low latency micro batch sub second that's not quite streaming but in many cases it's fast enough and we're seeing a lot of adoption of near real time not quite real time as uh good enough for most for many applications because nobody's really taking the hardware dimension of this information like how do we that'll just happen carl so near real time maybe before you lose the customer however you define that right okay um let's move on to brad brad you want to talk about automation ai uh the the the pipeline people feel like hey we can just automate everything what's your prediction yeah uh i'm i'm an ai fiction auto so apologies in advance for that but uh you know um i i think that um we've been seeing automation at play within ai for some time now and it's helped us do do a lot of things for especially for practitioners that are building ai outcomes in the enterprise uh it's it's helped them to fill skills gaps it's helped them to speed development and it's helped them to to actually make ai better uh because it you know in some ways provides some swim lanes and and for example with technologies like ottawa milk and can auto document and create that sort of transparency that that we talked about a little bit earlier um but i i think it's there's an interesting kind of conversion happening with this idea of automation um and and that is that uh we've had the automation that started happening for practitioners it's it's trying to move outside of the traditional bounds of things like i'm just trying to get my features i'm just trying to pick the right algorithm i'm just trying to build the right model uh and it's expanding across that full life cycle of building an ai outcome to start at the very beginning of data and to then continue on to the end which is this continuous delivery and continuous uh automation of of that outcome to make sure it's right and it hasn't drifted and stuff like that and because of that because it's become kind of powerful we're starting to to actually see this weird thing happen where the practitioners are starting to converge with the users and that is to say that okay if i'm in tableau right now i can stand up salesforce einstein discovery and it will automatically create a nice predictive algorithm for me um given the data that i that i pull in um but what's starting to happen and we're seeing this from the the the companies that create business software so salesforce oracle sap and others is that they're starting to actually use these same ideals and a lot of deep learning to to basically stand up these out of the box flip a switch and you've got an ai outcome at the ready for business users and um i i'm very much you know i think that that's that's the way that it's going to go and what it means is that ai is is slowly disappearing uh and i don't think that's a bad thing i think if anything what we're going to see in 2022 and maybe into 2023 is this sort of rush to to put this idea of disappearing ai into practice and have as many of these solutions in the enterprise as possible you can see like for example sap is going to roll out this quarter this thing called adaptive recommendation services which which basically is a cold start ai outcome that can work across a whole bunch of different vertical markets and use cases it's just a recommendation engine for whatever you need it to do in the line of business so basically you're you're an sap user you look up to turn on your software one day and you're a sales professional let's say and suddenly you have a recommendation for customer churn it's going that's great well i i don't know i i think that's terrifying in some ways i think it is the future that ai is going to disappear like that but i am absolutely terrified of it because um i i think that what it what it really does is it calls attention to a lot of the issues that we already see around ai um specific to this idea of what what we like to call it omdia responsible ai which is you know how do you build an ai outcome that is free of bias that is inclusive that is fair that is safe that is secure that it's audible etc etc etc etc that takes some a lot of work to do and so if you imagine a customer that that's just a sales force customer let's say and they're turning on einstein discovery within their sales software you need some guidance to make sure that when you flip that switch that the outcome you're going to get is correct and that's that's going to take some work and so i think we're going to see this let's roll this out and suddenly there's going to be a lot of a lot of problems a lot of pushback uh that we're going to see and some of that's going to come from gdpr and others that sam jeeve was mentioning earlier a lot of it's going to come from internal csr requirements within companies that are saying hey hey whoa hold up we can't do this all at once let's take the slow route let's make ai automated in a smart way and that's going to take time yeah so a couple predictions there that i heard i mean ai essentially you disappear it becomes invisible maybe if i can restate that and then if if i understand it correctly brad you're saying there's a backlash in the near term people can say oh slow down let's automate what we can those attributes that you talked about are non trivial to achieve is that why you're a bit of a skeptic yeah i think that we don't have any sort of standards that companies can look to and understand and we certainly within these companies especially those that haven't already stood up in internal data science team they don't have the knowledge to understand what that when they flip that switch for an automated ai outcome that it's it's gonna do what they think it's gonna do and so we need some sort of standard standard methodology and practice best practices that every company that's going to consume this invisible ai can make use of and one of the things that you know is sort of started that google kicked off a few years back that's picking up some momentum and the companies i just mentioned are starting to use it is this idea of model cards where at least you have some transparency about what these things are doing you know so like for the sap example we know for example that it's convolutional neural network with a long short-term memory model that it's using we know that it only works on roman english uh and therefore me as a consumer can say oh well i know that i need to do this internationally so i should not just turn this on today great thank you carl can you add anything any context here yeah we've talked about some of the things brad mentioned here at idc in the our future of intelligence group regarding in particular the moral and legal implications of having a fully automated you know ai uh driven system uh because we already know and we've seen that ai systems are biased by the data that they get right so if if they get data that pushes them in a certain direction i think there was a story last week about an hr system that was uh that was recommending promotions for white people over black people because in the past um you know white people were promoted and and more productive than black people but not it had no context as to why which is you know because they were being historically discriminated black people being historically discriminated against but the system doesn't know that so you know you have to be aware of that and i think that at the very least there should be controls when a decision has either a moral or a legal implication when when you want when you really need a human judgment it could lay out the options for you but a person actually needs to authorize that that action and i also think that we always will have to be vigilant regarding the kind of data we use to train our systems to make sure that it doesn't introduce unintended biases and to some extent they always will so we'll always be chasing after them that's that's absolutely carl yeah i think that what you have to bear in mind as a as a consumer of ai is that it is a reflection of us and we are a very flawed species uh and so if you look at all the really fantastic magical looking supermodels we see like gpt three and four that's coming out z they're xenophobic and hateful uh because the people the data that's built upon them and the algorithms and the people that build them are us so ai is a reflection of us we need to keep that in mind yeah we're the ai's by us because humans are biased all right great okay let's move on doug henson you know a lot of people that said that data lake that term's not not going to not going to live on but it appears to be have some legs here uh you want to talk about lake house bring it on yes i do my prediction is that lake house and this idea of a combined data warehouse and data lake platform is going to emerge as the dominant data management offering i say offering that doesn't mean it's going to be the dominant thing that organizations have out there but it's going to be the predominant vendor offering in 2022. now heading into 2021 we already had cloudera data bricks microsoft snowflake as proponents in 2021 sap oracle and several of these fabric virtualization mesh vendors join the bandwagon the promise is that you have one platform that manages your structured unstructured and semi-structured information and it addresses both the beyond analytics needs and the data science needs the real promise there is simplicity and lower cost but i think end users have to answer a few questions the first is does your organization really have a center of data gravity or is it is the data highly distributed multiple data warehouses multiple data lakes on-premises cloud if it if it's very distributed and you you know you have difficulty consolidating and that's not really a goal for you then maybe that single platform is unrealistic and not likely to add value to you um you know also the fabric and virtualization vendors the the mesh idea that's where if you have this highly distributed situation that might be a better path forward the second question if you are looking at one of these lake house offerings you are looking at consolidating simplifying bringing together to a single platform you have to make sure that it meets both the warehouse need and the data lake need so you have vendors like data bricks microsoft with azure synapse new really to the data warehouse space and they're having to prove that these data warehouse capabilities on their platforms can meet the scaling requirements can meet the user and query concurrency requirements meet those tight slas and then on the other hand you have the or the oracle sap snowflake the data warehouse uh folks coming into the data science world and they have to prove that they can manage the unstructured information and meet the needs of the data scientists i'm seeing a lot of the lake house offerings from the warehouse crowd managing that unstructured information in columns and rows and some of these vendors snowflake in particular is really relying on partners for the data science needs so you really got to look at a lake house offering and make sure that it meets both the warehouse and the data lake requirement well thank you doug well tony if those two worlds are going to come together as doug was saying the analytics and the data science world does it need to be some kind of semantic layer in between i don't know weigh in on this topic if you would oh didn't we talk about data fabrics before common metadata layer um actually i'm almost tempted to say let's declare victory and go home in that this is actually been going on for a while i actually agree with uh you know much what doug is saying there which is that i mean we i remembered as far back as i think it was like 2014 i was doing a a study you know it was still at ovum predecessor omnia um looking at all these specialized databases that were coming up and seeing that you know there's overlap with the edges but yet there was still going to be a reason at the time that you would have let's say a document database for json you'd have a relational database for tran you know for transactions and for data warehouse and you had you know and you had basically something at that time that that resembles to do for what we're considering a day of life fast fo and the thing is what i was saying at the time is that you're seeing basically blur you know sort of blending at the edges that i was saying like about five or six years ago um that's all and the the lake house is essentially you know the amount of the the current manifestation of that idea there is a dichotomy in terms of you know it's the old argument do we centralize this all you know you know in in in in in a single place or do we or do we virtualize and i think it's always going to be a yin and yang there's never going to be a single single silver silver bullet i do see um that they're also going to be questions and these are things that points that doug raised they're you know what your what do you need of of of your of you know for your performance there or for your you know pre-performance characteristics do you need for instance hiking currency you need the ability to do some very sophisticated joins or is your requirement more to be able to distribute and you know distribute our processing is you know as far as possible to get you know to essentially do a kind of brute force approach all these approaches are valid based on you know based on the used case um i just see that essentially that the lake house is the culmination of it's nothing it's just it's a relatively new term introduced by databricks a couple years ago this is the culmination of basically what's been a long time trend and what we see in the cloud is that as we start seeing data warehouses as a checkbox item say hey we can basically source data in cloud and cloud storage and s3 azure blob store you know whatever um as long as it's in certain formats like you know like you know parquet or csv or something like that you know i see that as becoming kind of you know a check box item so to that extent i think that the lake house depending on how you define it is already reality um and in some in some cases maybe new terminology but not a whole heck of a lot new under the sun yeah and dave menger i mean a lot of this thank you tony but a lot of this is going to come down to you know vendor marketing right some people try to co-opt the term we talked about data mesh washing what are your thoughts on this yeah so um i used the term data platform earlier and and part of the reason i use that term is that it's more vendor neutral uh we've we've tried to uh sort of stay out of the the vendor uh terminology patenting world right whether whether the term lake house is what sticks or not the concept is certainly going to stick and we have some data to back it up about a quarter of organizations that are using data lakes today already incorporate data warehouse functionality into it so they consider their data lake house and data warehouse one in the same about a quarter of organizations a little less but about a quarter of organizations feed the data lake from the data warehouse and about a quarter of organizations feed the data warehouse from the data lake so it's pretty obvious that three quarters of organizations need to bring this stuff together right the need is there the need is apparent the technology is going to continue to verge converge i i like to talk about you know you've got data lakes over here at one end and i'm not going to talk about why people thought data lakes were a bad idea because they thought you just throw stuff in a in a server and you ignore it right that's not what a data lake is so you've got data lake people over here and you've got database people over here data warehouse people over here database vendors are adding data lake capabilities and data lake vendors are adding data warehouse capabilities so it's obvious that they're going to meet in the middle i mean i think it's like tony says i think we should there declare victory and go home and so so i it's just a follow-up on that so are you saying these the specialized lake and the specialized warehouse do they go away i mean johnny tony data mesh practitioners would say or or advocates would say well they could all live as just a node on the on the mesh but based on what dave just said are we going to see those all morph together well number one as i was saying before there's always going to be this sort of you know kind of you know centrifugal force or this tug of war between do we centralize the data do we do it virtualize and the fact is i don't think that work there's ever going to be any single answer i think in terms of data mesh data mesh has nothing to do with how you physically implement the data you could have a data mesh on a basically uh on a data warehouse it's just that you know the difference being is that if we use the same you know physical data store but everybody's logically manual basically governing it differently you know um a data mission is basically it's not a technology it's a process it's a governance process um so essentially um you know you know i basically see that you know as as i was saying before that this is basically the culmination of a long time trend we're essentially seeing a lot of blurring but there are going to be cases where for instance if i need let's say like observe i need like high concurrency or something like that there are certain things that i'm not going to be able to get efficiently get out of a data lake um and you know we're basically i'm doing a system where i'm just doing really brute forcing very fast file scanning and that type of thing so i think there always will be some delineations but i would agree with dave and with doug that we are seeing basically a a confluence of requirements that we need to essentially have basically the element you know the ability of a data lake and a data laid out their warehouse we these need to come together so i think what we're likely to see is organizations look for a converged platform that can handle both sides for their center of data gravity the mesh and the fabric vendors the the fabric virtualization vendors they're all on board with the idea of this converged platform and they're saying hey we'll handle all the edge cases of the stuff that isn't in that center of data gradient that is off distributed in a cloud or at a remote location so you can have that single platform for the center of of your your data and then bring in virtualization mesh what have you for reaching out to the distributed data bingo as they basically said people are happy when they virtualize data i i think yes at this point but to this uh dave meningas point you know they have convert they are converging snowflake has introduced support for unstructured data so now we are literally splitting here now what uh databricks is saying is that aha but it's easy to go from data lake to data warehouse than it is from data warehouse to data lake so i think we're getting into semantics but we've already seen these two converge so is that so it takes something like aws who's got what 15 data stores are they're going to have 15 converged data stores that's going to be interesting to watch all right guys i'm going to go down the list and do like a one i'm going to one word each and you guys each of the analysts if you wouldn't just add a very brief sort of course correction for me so sanjeev i mean governance is going to be the maybe it's the dog that wags the tail now i mean it's coming to the fore all this ransomware stuff which really didn't talk much about security but but but what's the one word in your prediction that you would leave us with on governance it's uh it's going to be mainstream mainstream okay tony bear mesh washing is what i wrote down that's that's what we're going to see in uh in in 2022 a little reality check you you want to add to that reality check is i hope that no vendor you know jumps the shark and calls their offering a data mesh project yeah yeah let's hope that doesn't happen if they do we're going to call them out uh carl i mean graph databases thank you for sharing some some you know high growth metrics i know it's early days but magic is what i took away from that it's the magic database yeah i would actually i've said this to people too i i kind of look at it as a swiss army knife of data because you can pretty much do anything you want with it it doesn't mean you should i mean that's definitely the case that if you're you know managing things that are in a fixed schematic relationship probably a relational database is a better choice there are you know times when the document database is a better choice it can handle those things but maybe not it may not be the best choice for that use case but for a great many especially the new emerging use cases i listed it's the best choice thank you and dave meninger thank you by the way for bringing the data in i like how you supported all your comments with with some some data points but streaming data becomes the sort of default uh paradigm if you will what would you add yeah um i would say think fast right that's the world we live in you got to think fast fast love it uh and brad shimon uh i love it i mean on the one hand i was saying okay great i'm afraid i might get disrupted by one of these internet giants who are ai experts so i'm gonna be able to buy instead of build ai but then again you know i've got some real issues there's a potential backlash there so give us the there's your bumper sticker yeah i i would say um going with dave think fast and also think slow uh to to talk about the book that everyone talks about i would say really that this is all about trust trust in the idea of automation and of a transparent invisible ai across the enterprise but verify verify before you do anything and then doug henson i mean i i look i think the the trend is your friend here on this prediction with lake house is uh really becoming dominant i liked the way you set up that notion of you know the the the data warehouse folks coming at it from the analytics perspective but then you got the data science worlds coming together i still feel as though there's this piece in the middle that we're missing but your your final thoughts we'll give you the last well i think the idea of consolidation and simplification uh always prevails that's why the appeal of a single platform is going to be there um we've already seen that with uh you know hadoop platforms moving toward cloud moving toward object storage and object storage becoming really the common storage point for whether it's a lake or a warehouse uh and that second point uh i think esg mandates are uh are gonna come in alongside uh gdpr and things like that to uh up the ante for uh good governance yeah thank you for calling that out okay folks hey that's all the time that that we have here your your experience and depth of understanding on these key issues and in data and data management really on point and they were on display today i want to thank you for your your contributions really appreciate your time enjoyed it thank you now in addition to this video we're going to be making available transcripts of the discussion we're going to do clips of this as well we're going to put them out on social media i'll write this up and publish the discussion on wikibon.com and siliconangle.com no doubt several of the analysts on the panel will take the opportunity to publish written content social commentary or both i want to thank the power panelist and thanks for watching this special cube presentation this is dave vellante be well and we'll see you next time [Music] you
SUMMARY :
the end of the day need to speak you
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
381 databases | QUANTITY | 0.99+ |
2014 | DATE | 0.99+ |
2022 | DATE | 0.99+ |
2021 | DATE | 0.99+ |
january of 2022 | DATE | 0.99+ |
100 users | QUANTITY | 0.99+ |
jamal dagani | PERSON | 0.99+ |
last week | DATE | 0.99+ |
dave meninger | PERSON | 0.99+ |
sanji | PERSON | 0.99+ |
second question | QUANTITY | 0.99+ |
15 converged data stores | QUANTITY | 0.99+ |
dave vellante | PERSON | 0.99+ |
microsoft | ORGANIZATION | 0.99+ |
three | QUANTITY | 0.99+ |
sanjeev | PERSON | 0.99+ |
2023 | DATE | 0.99+ |
15 data stores | QUANTITY | 0.99+ |
siliconangle.com | OTHER | 0.99+ |
last year | DATE | 0.99+ |
sanjeev mohan | PERSON | 0.99+ |
six | QUANTITY | 0.99+ |
two | QUANTITY | 0.99+ |
carl | PERSON | 0.99+ |
tony | PERSON | 0.99+ |
carl olufsen | PERSON | 0.99+ |
six years | QUANTITY | 0.99+ |
david | PERSON | 0.99+ |
carlos specter | PERSON | 0.98+ |
both sides | QUANTITY | 0.98+ |
2010s | DATE | 0.98+ |
first backlash | QUANTITY | 0.98+ |
five years | QUANTITY | 0.98+ |
today | DATE | 0.98+ |
dave | PERSON | 0.98+ |
each | QUANTITY | 0.98+ |
three quarters | QUANTITY | 0.98+ |
first | QUANTITY | 0.98+ |
single platform | QUANTITY | 0.98+ |
lake house | ORGANIZATION | 0.98+ |
both | QUANTITY | 0.98+ |
this year | DATE | 0.98+ |
doug | PERSON | 0.97+ |
one word | QUANTITY | 0.97+ |
this year | DATE | 0.97+ |
wikibon.com | OTHER | 0.97+ |
one platform | QUANTITY | 0.97+ |
39 | QUANTITY | 0.97+ |
about 600 percent | QUANTITY | 0.97+ |
two analysts | QUANTITY | 0.97+ |
ten years | QUANTITY | 0.97+ |
single platform | QUANTITY | 0.96+ |
five | QUANTITY | 0.96+ |
one | QUANTITY | 0.96+ |
three quarters | QUANTITY | 0.96+ |
california | LOCATION | 0.96+ |
ORGANIZATION | 0.96+ | |
single | QUANTITY | 0.95+ |
Dipti Borkar, Ahana, and Derrick Harcey, Securonix | CUBE Conversation, July 2021
(upbeat music) >> Welcome to theCUBE Conversation. I'm John Furrier, host of theCUBE here in Palo Alto, California, in our studios. We've got a great conversation around open data link analytics on AWS, two great companies, Ahana and Securonix. Dipti Borkar, Co-founder and Chief Product Officer at Ahana's here. Great to see you, and Derrick Harcey, Chief Architect at Securonix. Thanks for coming on, really appreciate you guys spending the time. >> Yeah, thanks so much, John. Thank you for having us and Derrick, hello again. (laughing) >> Hello, Dipti. >> We had a great conversation around our startup showcase, which you guys were featured last month this year, 2021. The conversation continues and a lot of people are interested in this idea of open systems, open source. Obviously open data lakes is really driving a lot of value, especially with machine learning and whatnot. So this is a key, key point. So can you guys just take a step back before we get under the hood and set the table on Securonix and Ahana? What's the big play here? What is the value proposition? >> Why sure, I'll give a quick update. Securonix has been in the security business. First, a user and entity, behavioral analytics, and then the next generation SIEM platform for 10 years now. And we really need to take advantage of some cutting edge technologies in the open source community and drive adoption and momentum that we can not only bring in data from our customers, that they can find security threats, but also store in a way that they can use for other purposes within their organization. That's where the open data lake is very critical. >> Yeah and to add on to that, John, what we've seen, you know, traditionally we've had data warehouses, right? We've had operational systems move all of their data into the warehouse and those, you know, while these systems are really good, built for good use cases, the amount of data is exploding, the types of data is exploding, different types, semi-structured, structured and so when, as companies like Securonix in the security space, as well as other verticals, look for getting more insights out of their data, there's a new approach that's emerging where you have a data lake, which AWS has revolutionized with S3 and commoditized and there's analytics that's built on top of it. And so we're seeing a lot of good advantages that come out of this new approach. >> Well, it's interesting EC2 and S3 are having their 15th birthday, as they say in Amazon's interesting teenage years, but while I got you guys here, I want to just ask you, can you define the SIEM thing because the SIEM market is exploding, it just changed a little bit. Obviously it's data, event management, but again, as data becomes more proliferating, and it's not stopping anytime soon, as cloud native applications emerge, why is this important? What is this SIEM category? What's it about? >> Yeah, thanks. I'll take that. So obviously SIEM traditionally has been around for about a couple of decades and it really started with first log collection and management and rule-based threat detection. Now what we call next generation SIEM is really the modernization of a security platform that includes streaming threat detection and behavioral analysis and data analytics. We literally look for thousands of different threat detection techniques, and we chained together sequences of events and we stream everything in real time and it's very important to find threats as quickly as possible. But the momentum that we see in the industry as we see massive sizes of customers, we have made a transition from on-premise to the cloud and we literally are processing tens of petabytes of data for our customers. And it's critical that we can adjust data quickly, find threats quickly and allow customers to have the tools to respond to those security incidents quickly and really get the handle on their security posture. >> Derrick, if I ask you what's different about this next gen SIEM, what would you say and what's the big a-ha? What's the moment there? What's the key thing? >> The real key is taking the off the boundaries of scale. We want to be able to ingest massive quantities of data. We want to be able to do instant threat detection, and we want to be able to search on the entire forensic data set across all of the history of our customer base. In the past, we had to make sacrifices, either on the amount of data we ingest or the amount of time that we stored that data. And the really the next generation SIEM platform is offering advanced capabilities on top of that data set because those boundaries are no longer barriers for us. >> Dipti, any comment before I jump into the question for you? >> Yeah, you know, absolutely. It is about scale and like I mentioned earlier, the amount of data is only increasing and it's also the types of information. So the systems that were built to process this information in the past are, you know, support maybe terabytes of data, right? And that's where new technologies open source engines like Presto come in, which were built to handle internet scale. Presto was kind of created at Facebook to handle these petabytes that Derrick is talking about that every industry is now seeing where we're are moving from gigs to terabytes to petabytes. And that's where the analytic stack is moving. >> That's a great segue. I want to ask you while I got you here 'cause this is again, the definitions, 'cause people love to hear the experts weigh in. What is open data lake analytics? How would you define that? And then talk about where Presto fits in. >> Yeah, that's a great question. So the way I define open data lake analytics is you have a data lake on the core, which is, let's say S3, it's the most popular one, but on top of it, there are open aspects, it is open format. Open formats play a very important role because you can have different types of processing. It could be SQL processing, it could be machine learning, it could be other types of workloads, all work on these open formats versus a proprietary format where it's locked and it's open interfaces. Open interfaces that are like SQL, JDBC, ODBC is widely accessible to a range of tools. And so it's everywhere. Open source is a very important part of it. As companies like Securonix pick these technologies for their mission critical systems, they want to know that this is going to be available and open for them for a long period of time. And that's why open source becomes important. And then finally, I would say open cloud because at the end of the day, you know, while AWS is where a lot of the innovations happening, a lot of the market is, there are other clouds and open cloud is something that these engines were built for, right? So that's how I define open data lake analytics. It's analytics with query engines built on top of these open formats, open source, open interfaces and open cloud. Now Presto comes in where you want to find the needle in the haystack, right? And so when you have these deep questions about where did the threat come from or who was it, right? You have to ask these questions of your data. And Presto is an open source distributed SQL engine that allows data platform teams to run queries on their data lakes in a high-performance ways, in memory and on these petabytes of data. So that's where Presto fits in. It's one of the defacto query engines for SQL analysis on the data lake. So hopefully that answers the question, gives more context. >> Yeah, I mean, the joke about data lakes has been you don't want to be a data swamp, right? That's what people don't want. >> That's right. >> But at the same time, the needle in the haystack, it's like big data is like a needle in a haystack of needles. So there's a constant struggle to getting that data, the right data at the right time. And what I learned in the last presentation, you guys both presented, your teams presented at the conference was the managed service approach. Could you guys talk about why that approach works well together with you guys? Because I think when people get to the cloud, they replatform, then they start refactoring and data becomes a real big part of that. Why is the managed service the best approach to solving these problems? >> Yeah and interestingly, both Securonix and Ahana have a managed service approach so maybe Derrick can go first and I can go after. >> Yeah, yeah. I'll be happy to go first. You know, we really have found making the transition over the last decade from off premise to the cloud for the majority of our customers that running a large open data lake requires a lot of different skillsets and there's hundreds of technologies in the open source community to choose from and to be able to choose the right blend of skillsets and technologies to produce a comprehensive service is something that customers can do, many customers did do, and it takes a lot of resources and effort. So what we really want to be able to do is take and package up our security service, our next generation SIEM platform to our customers where they don't need to become experts in every aspect of it. Now, an underlying component of that for us is how we store data in an open standards way and how we access that data in an open standards way. So just like we want our customers to get immediate value from the security services that we provide, we also want to be able take advantage of a search service that is offered to us and supported by a vendor like Ahana where we can very quickly take advantage of that value within our core underlying platform. So we really want to be able to make a frictionless effort to allow our customers achieve value as quick as possible. >> That's great stuff. And on the Ahana side, open data lakes, really the ease of use there, it sounds easy to me, but we know it's not easy just to put data in a data lake. At the end of the day, a lot of customers want simplicity 'cause they don't have the staffing. This comes up a lot. How do you leverage their open source participation and/or getting stood up quickly so they can get some value? Because that seems to be the number one thing people want right now. Dipti, how does that work? How do people get value quickly? >> Yeah, absolutely. When you talk about these open source press engines like Presto and others, right? They came out of these large internet companies that have a lot of distributed systems, engineers, PhDs, very kind of advanced level teams. And they can manage these distributed systems building onto them, add features at large scale, but not every company can and these engines are extremely powerful. So when you combine the power of Presto with the cloud and a managed service, that's where value for everyone comes in. And that's what I did with Ahana is looked at Presto, which is a great engine, but converted it into a great user experience so that whether it's a three person platform team or a five person platform team, they still get the same benefit of Presto that a Facebook gets, but at much, much a less operational complexity cost, as well as the ability to depend on a vendor who can then drive the innovation and make it even better. And so that's where managed services really com in. There's thousands of credit parameters that need to be tuned. With Ahana, you get it out of the box. So you have the best practices that are followed at these larger companies. Our team comes from Facebook, HuBERT and others, and you get that out of the box, with a few clicks you can get up and running. And so you see value immediately, in 30 minutes you're up and running and you can create your data lake versus with Hadoop and these prior systems, it would take months to receive real value from some of these systems. >> Yeah, we saw the Hadoop scar tissue is all great and all good now, but it takes too much resource, standing up clusters, managing it, you can't hire enough people. I got to ask you while you're on that topic, do you guys ship templates? How do you solve the problem of out of the box? You mentioned some out of the box capability. Do you guys think of as recipes, templates? What's your thoughts around what you're providing customers to get up and running? >> Yeah so in the case of Securonix, right, let's say they want to create a Presto cluster. They go into our SAS console. You essentially put in the number of nodes that you want. Number of workers you want. There's a lot of additional value that we built in like caching capabilities if you want more performance, built in cataloging that's again, another single click. And there isn't really as much of a template. Everybody gets the best tuned Presto for their workloads. Now there are certain workloads where you might have interactive in some cases, or you might have transformation batch ETL, and what we're doing next is actually giving you the knobs so that it comes pre tuned for the type of workload that you want to run versus you figuring it out. And so that's what I mean by out of the box, where you don't have to worry about these configuration parameters. You get the performance. And maybe Derrick can you talk a little bit about the benefits of the managed service and the usage as well. >> Yeah, absolutely. So, I'll answer the same question and then I'll tie back to what Dipti asked. Really, you know, our customers, we want it to be very easy for them to ingest security event logs. And there's really hundreds of types of a security event logs that we support natively out of the box, but the key for us is a standard that we call the open event format. And that is a normalized schema. We take any data source in it's normalized format, be a collector device a customer uses on-premise, they send the data up to our cloud, we do streaming analysis and data analytics to determine where the threats are. And once we do that, then we send the data off to a long-term storage format in a standards-based Parquet file. And that Parquet file is natively read by the Ahana service. So we simply deploy an Ahana cluster that uses the Presto engine that natively supports our open standard file format. And we have a normalized schema that our application can immediately start to see value from. So we handle the collection and streaming ingest, and we simply leverage the engine in Ahana to give us the appropriate scale. We can size up and down and control the cost to give the users the experience that they're paying for. >> I really love this topic because one, not only is it cutting edge, but it's very relevant for modern applications. You mentioned next gen SIEMs, SIEM, security information event management, not SIM as memory card, which I think of all the time because I always want to add more, but this brings up the idea of streaming data real-time, but as more services go to the cloud, Derrick, if you don't mind sharing more on this. Share the journey that you guys gone through, because I think a lot of people are looking at the cloud and saying, and I've been in a lot of these conversations about repatriation versus cloud. People aren't going that way. They're going more innovation with his net new revenue models emerging from the value that they're getting out of understanding events that are happening within the network and the apps, even when they're being stood up and torn down. So there's a lot of cloud native action going on where just controlling and understanding is way beyond the, just put stuff into an event log. It's a whole nother animal. >> Well, there's a couple of paradigm shifts that we've seen major patterns for in the last five or six years. Like I said, we started with the safe streaming ingest platform on premise. We use some different open source technologies. What we've done when we moved to the cloud is we've adopted cloud native services as part of our underlying platform to modernize and make our service cloud native. But what we're seeing as many customers either want to focus on on-premise deployments and especially financial institutions and government institute things, because they are very risk averse. Now we're seeing even those customers are realizing that it's very difficult to maintain the hundreds or thousands of servers that it requires on premise and have the large skilled staff required to keep it running. So what we're seeing now is a lot of those customers deployed some packaged products like our own, and even our own customers are doing a mass migration to the cloud because everything is handled for them as a service. And we have a team of experts that we maintain to support all of our global customers, rather than every one of our global customers having their own teams that we then support on the back end. So it's a much more efficient model. And then the other major approach that many of our customers also went down the path of is, is building their own security data lake. And many customers were somewhat successful in building their own security data lake but in order to keep up with the innovation, if you look at the analyst groups, the Gartner Magic Quadrant on the SIEM space, the feature set that is provided by a packaged product is a very large feature set. And even if somebody was put together all of the open source technologies to meet 20% of those features, just maintaining that over time is very expensive and very difficult. So we want to provide a service that has all of the best in class features, but also leverages the ability to innovate on the backend without the customer knowing. So we can do a technology shift to Ahana and Presto from our previous technology set. The customer doesn't know the difference, but they see the value add within the service that we're offering. >> So if I get this right, Derrick, Presto's enabling you guys to do threat detection at a level that you're super happy with as well as giving you the option for give self-service. Is that right for the, is that a kind of a- >> Well, let me clarify our definition. So we do streaming threat detection. So we do a machine learning based behavioral analysis and threat detection on rule-based correlation as well. So we do threat detection during the streaming process, but as part of the process of managing cybersecurity, the customer has a team of security analysts that do threat hunting. And the threat hunting is where Ahana comes in. So a human gets involved and starts searches for the forensic logs to determine what happened over time that might be suspicious and they start to investigate through a series of queries to give them the information that's relevant. And once they find information that's relevant, then they package it up into an algorithm that will do a analysis on an ongoing basis as part of the stream processing. So it's really part of the life cycle of hunting a real time threat detection. >> It's kind of like old adage hunters and farmers, you're farming through the streaming and hunting with the detection. I got to ask you, what would it be the alternative if you go back, I mean, I know cloud's so great because you have cutting edge applications and technologies. Without Presto, where would you be? I mean, what would be life like without these capabilities? What would have to happen? >> Well, the issue is not that we had the same feature set before we moved to Presto, but the challenge was on scale. The cost profile to continue to grow from 100 terabytes to one petabyte, to tens of petabytes, not only was it expensive, but it just, the scaling factors were not linear. So not only did we have a problem with the costs, but we also had a problem with the performance tailing off and keeping the service running. A large Hadoop cluster, for example, our first incarnation of this use, the hive service, in order to query data in a MapReduce cluster. So it's a completely different technology that uses a distributed Hadoop compute cluster to do the query. It does work, but then we start to see resource contention with that, and all the other things in the Hadoop platform. The Presto engine has the beauty of it, not only was it designed for scale, but it's feature built just for a query engine and that's the providing the right tool for the job, as opposed to a general purpose tool. >> Derrick, you've got a very busy job as chief architect. What are you excited about going forward when you look at the cloud technologies? What are you looking at? What are you watching? What are you getting excited about or what worries you? >> Well, that's a good question. What we're really doing, I'm leading up a group called the Securonix Innovation Labs, and we're looking at next generation technologies. We go through and analyze both open source technologies, technologies that are proprietary as well as building own technologies. And that's where we came across Ahana as part of a comprehensive analysis of different search engines, because we wanted to go through another round of search engine modernization, and we worked together in a partnership, and we're going to market together as part of our modernization efforts that we're continuously going through. So I'm looking forward to iterative continuous improvement over time. And this next journey, what we're seeing because of the growth in cybersecurity, really requires new and innovative technologies to work together holistically. >> Dipti, you got a great company that you co-founded. I got to ask you as the co-founder and chief product officer, you both the lead entrepreneur also, got the keys to the kingdom with the products. You got to balance that 20 miles stare out in the future while driving product excellence. You've got open source as a tailwind. What's on your mind as you go forward with your venture? >> Yeah. Great question. It's been super exciting to have found the Ahana in this space, cloud data and open source. That's where the action is happening these days, but there's two parts to it. One is making our customers successful and continuously delivering capabilities, features, continuing on our ease of use theme and a foundation to get customers like Securonix and others to get most value out of their data and as fast as possible, right? So that's a continuum. In terms of the longer term innovation, the way I see the space, there is a lot more innovation to be done and Presto itself can be made even better and there's a next gen Presto that we're working on. And given that Presto is a part of the foundation, the Linux Foundation, a lot of this innovation is happening together collaboratively with Facebook, with Uber who are members of the foundation with us. Securonix, we look forward to making a part of that foundation. And that innovation together can then benefit the entire community as well as the customer base. This includes better performance with more capabilities built in, caching and many other different types of database innovations, as well as scaling, auto scaling and keeping up with this ease of use theme that we're building on. So very exciting to work together with all these companies, as well as Securonix who's been a fantastic partner. We work together, build features together, and I look at delivering those features and functionalities to be used by these analysts, data scientists and threat hunters as Derrick called them. >> Great success, great partnership. And I love the open innovation, open co-creation you guys are doing together and open data lakes, great concept, open data analytics as well. This is the future. Insights coming from the open and sharing and actually having some standards. I love this topic, so Dipti, thank you very much, and Derrick, thanks for coming on and sharing on this Cube Conversation. Thanks for coming on. >> Thank you so much, John. >> Thanks for having us. >> Thanks. Take care. Bye-bye. >> Okay, it's theCube Conversation here in Palo Alto, California. I'm John furrier, your host of theCube. Thanks for watching. (upbeat music)
SUMMARY :
guys spending the time. and Derrick, hello again. and set the table on Securonix and Ahana? and momentum that we can into the warehouse and those, you know, because the SIEM market is exploding, and really get the handle either on the amount of data we ingest and it's also the types of information. hear the experts weigh in. So hopefully that answers the Yeah, I mean, the joke Why is the managed Yeah and interestingly, a search service that is offered to us And on the Ahana side, open data lakes, and you get that out of the box, I got to ask you while and the usage as well. and control the cost from the value that they're getting and have the large skilled staff as well as giving you the for the forensic logs to and hunting with the detection. and that's the providing when you look at the cloud technologies? because of the growth in cybersecurity, got the keys to the and a foundation to get And I love the open here in Palo Alto, California.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Securonix | ORGANIZATION | 0.99+ |
John | PERSON | 0.99+ |
Derrick Harcey | PERSON | 0.99+ |
Derrick | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
Ahana | ORGANIZATION | 0.99+ |
Ahana | PERSON | 0.99+ |
John Furrier | PERSON | 0.99+ |
20% | QUANTITY | 0.99+ |
July 2021 | DATE | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
Dipti | PERSON | 0.99+ |
100 terabytes | QUANTITY | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
10 years | QUANTITY | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
hundreds | QUANTITY | 0.99+ |
Linux Foundation | ORGANIZATION | 0.99+ |
two parts | QUANTITY | 0.99+ |
thousands | QUANTITY | 0.99+ |
Securonix Innovation Labs | ORGANIZATION | 0.99+ |
tens of petabytes | QUANTITY | 0.99+ |
30 minutes | QUANTITY | 0.99+ |
one petabyte | QUANTITY | 0.99+ |
Dipti Borkar | PERSON | 0.99+ |
20 miles | QUANTITY | 0.99+ |
Palo Alto, California | LOCATION | 0.99+ |
five person | QUANTITY | 0.99+ |
First | QUANTITY | 0.99+ |
SQL | TITLE | 0.99+ |
last month | DATE | 0.99+ |
both | QUANTITY | 0.99+ |
One | QUANTITY | 0.98+ |
15th birthday | QUANTITY | 0.97+ |
two great companies | QUANTITY | 0.96+ |
HuBERT | ORGANIZATION | 0.96+ |
Hadoop | TITLE | 0.96+ |
S3 | TITLE | 0.96+ |
hundreds of technologies | QUANTITY | 0.96+ |
three person | QUANTITY | 0.95+ |
Parquet | TITLE | 0.94+ |
first incarnation | QUANTITY | 0.94+ |
first | QUANTITY | 0.94+ |
Presto | ORGANIZATION | 0.93+ |
Gartner | ORGANIZATION | 0.93+ |
last decade | DATE | 0.92+ |
terabytes of data | QUANTITY | 0.92+ |
first log | QUANTITY | 0.91+ |
single click | QUANTITY | 0.9+ |
Presto | PERSON | 0.9+ |
theCUBE | ORGANIZATION | 0.88+ |
Steven Mih, Ahana and Sachin Nayyar, Securonix | AWS Startup Showcase
>> Voiceover: From theCUBE's Studios in Palo Alto in Boston, connecting with thought leaders all around the world, this is theCUBE Conversation. >> Welcome back to theCUBE's coverage of the AWS Startup Showcase. Next Big Thing in AI, Security and Life Sciences featuring Ahana for the AI Trek. I'm your host, John Furrier. Today, we're joined by two great guests, Steven Mih, Ahana CEO, and Sachin Nayyar, Securonix CEO. Gentlemen, thanks for coming on theCUBE. We're talking about the Next-Gen technologies on AI, Open Data Lakes, et cetera. Thanks for coming on. >> Thanks for having us, John. >> Thanks, John. >> What a great line up here. >> Sachin: Thanks, Steven. >> Great, great stuff. Sachin, let's get in and talk about your company, Securonix. What do you guys do? Take us through, I know you've got a slide to help us through this, I want to introduce your stuff first then jump in with Steven. >> Absolutely. Thanks again, Steven. Ahana team for having us on the show. So Securonix, we started the company in 2010. We are the leader in security analytics and response capability for the cybermarket. So basically, this is a category of solutions called SIEM, Security Incident and Event Management. We are the quadrant leaders in Gartner, we now have about 500 customers today and have been plugging away since 2010. Started the company just really focused on analytics using machine learning and an advanced analytics to really find the needle in the haystack, then moved from there to needle in the needle stack using more algorithms, analysis of analysis. And then kind of, I evolved the company to run on cloud and become sort of the biggest security data lake on cloud and provide all the analytics to help companies with their insider threat, cyber threat, cloud solutions, application threats, emerging internally and externally, and then response and have a great partnership with Ahana as well as with AWS. So looking forward to this session, thank you. >> Awesome. I can't wait to hear the news on that Next-Gen SIEM leadership. Steven, Ahana, talk about what's going on with you guys, give us the update, a lot of stuff happening. >> Yeah. Great to be here and thanks for that such, and we appreciate the partnership as well with both Securonix and AWS. Ahana is the open source company based on PrestoDB, which is a project that came out of Facebook and is widely used, one of the fastest growing projects in data analytics today. And we make a managed service for Presto easily on AWS, all cloud native. And we'll be talking about that more during the show. Really excited to be here. We believe in open source. We believe in all the challenges of having data in the cloud and making it easy to use. So thanks for having us again. >> And looking forward to digging into that managed service and why that's been so successful. Looking forward to that. Let's get into the Securonix Next-Gen SIEM leadership first. Let's share the journey towards what you guys are doing here. As the Open Data Lakes on AWS has been a hot topic, the success of data in the cloud, no doubt is on everyone's mind especially with the edge coming. It's just, I mean, just incredible growth. Take us through Sachin, what do you guys got going on? >> Absolutely. Thanks, John. We are hearing about cyber threats every day. No question about it. So in the past, what was happening is companies, what we have done as enterprise is put all of our eggs in the basket of solutions that were evaluating the network data. With cloud, obviously there is no more network data. Now we have moved into focusing on EDR, right thing to do on endpoint detection. But with that, we also need security analytics across on-premise and cloud. And your other solutions like your OT, IOT, your mobile, bringing it all together into a security data lake and then running purpose built analytics on top of that, and then having a response so we can prevent some of these things from happening or detect them in real time versus innovating for hours or weeks and months, which is is obviously too late. So with some of the recent events happening around colonial and others, we all know cybersecurity is on top of everybody's mind. First and foremost, I also want to. >> Steven: (indistinct) slide one and that's all based off on top of the data lake, right? >> Sachin: Yes, absolutely. Absolutely. So before we go into on Securonix, I also want to congratulate everything going on with the new cyber initiatives with our government and just really excited to see some of the things that the government is also doing in this space to bring, to have stronger regulation and bring together the government and the private sector. From a Securonix perspective, today, we have one third of the fortune 500 companies using our technology. In addition, there are hundreds of small and medium sized companies that rely on Securonix for their cyber protection. So what we do is, again, we are running the solution on cloud, and that is very important. It is not just important for hosting, but in the space of cybersecurity, you need to have a solution, which is not, so where we can update the threat models and we can use the intelligence or the Intel that we gather from our customers, partners, and industry experts and roll it out to our customers within seconds and minutes, because the game is real time in cybersecurity. And that you can only do in cloud where you have the complete telemetry and access to these environments. When we go on-premise traditionally, what you will see is customers are even thinking about pushing the threat models through their standard Dev test life cycle management, and which is just completely defeating the purpose. So in any event, Securonix on the cloud brings together all the data, then runs purpose-built analytics on it. Helps you find very few, we are today pulling in several million events per second from our customers, and we provide just a very small handful of events and reduce the false positives so that people can focus on them. Their security command center can focus on that and then configure response actions on top of that. So we can take action for known issues and have intelligence in all the layers. So that's kind of what the Securonix is focused on. >> Steven, he just brought up, probably the most important story in technology right now. That's ransomware more than, first of all, cybersecurity in general, but ransomware, he mentioned some of the government efforts. Some are saying that the ransomware marketplace is bigger than some governments, nation state governments. There's a business model behind it. It's highly active. It's dominating the scene and it's a real threat. This is the new world we're living in, cloud creates the refactoring capabilities. We're hearing that story here with Securonix. How does Presto and Securonix work together? Because I'm connecting the dots here in real time. I think you're going to go there. So take us through because this is like the most important topic happening. >> Yeah. So as Sachin said, there's all this data that needs to go into the cloud and it's all moving to the cloud. And there's a massive amounts of data and hundreds of terabytes, petabytes of data that's moving into the data lakes and that's the S3-based data lakes, which are the easiest, cheapest, commodified place to put all this data. But in order to deliver the results that Sachin's company is driving, which is intelligence on when there's a ransomware or possibility, you need to have analytics on them. And so Presto is the open source project that is a open source SQL query engine for data lakes and other data sources. It was created by Facebook as part of the Linux foundation, something called Presto foundation. And it was built to replace the complicated Hadoop stack in order to then drive analytics at very lightning fast queries on large, large sets of data. And so Presto fits in with this Open Data Lake analytics movement, which has made Presto one of the fastest growing projects out there. >> What is an Open Data Lake? Real quick for the audience who wants to learn on what it means. Does is it means it's open source in the Linux foundation or open meaning it's open to multiple applications? What does that even mean? >> Yeah. Open Data Lake analytics means that you're, first of all, your data lake has open formats. So it is made up of say something called the ORC or Parquet. And these are formats that any engine can be used against. That's really great, instead of having locked in data types. Data lakes can have all different types of data. It can have unstructured, semi-structured data. It's not just the structured data, which is typically in your data warehouses. There's a lot more data going into the Open Data Lake. And then you can, based on what workload you're looking to get benefit from, the insights come from that, and actually slide two covers this pictorially. If you look on the left here on slide two, the Open Data Lake is where all the data is pulling. And Presto is the layer in between that and the insights which are driven by the visualization, reporting, dashboarding, BI tools or applications like in Securonix case. And so analytics are now being driven by every company for not just industries of security, but it's also for every industry out there, retail, e-commerce, you name it. There's a healthcare, financials, all are looking at driving more analytics for their SaaSified applications as well as for their own internal analysts, data scientists, and folks that are trying to be more data-driven. >> All right. Let's talk about the relationship now with where Presto fits in with Securonix because I get the open data layer. I see value in that. I get also what we're talking about the cloud and being faster with the datasets. So how does, Sachin' Securonix and Ahana fit in together? >> Yeah. Great question. So I'll tell you, we have two customers. I'll give you an example. We have two fortune 10 customers. One has moved most of their operations to the cloud and another customer which is in the process, early stage. The data, the amount of data that we are getting from the customer who's moved fully to the cloud is 20 times, 20 times more than the customer who's in the early stages of moving to the cloud. That is because the ability to add this level of telemetry in the cloud, in this case, it happens to be AWS, Office 365, Salesforce and several other rescalers across several other cloud technologies. But the level of logging that we are able to get the telemetry is unbelievable. So what it does is it allows us to analyze more, protect the customers better, protect them in real time, but there is a cost and scale factor to that. So like I said, when you are trying to pull in billions of events per day from a customer billions of events per day, what the customers are looking for is all of that data goes in, all of data gets enriched so that it makes sense to a normal analyst and all of that data is available for search, sometimes 90 days, sometimes 12 months. And then all of that data is available to be brought back into a searchable format for up to seven years. So think about the amount of data we are dealing with here and we have to provide a solution for this problem at a price that is affordable to the customer and that a medium-sized company as well as a large organization can afford. So after a lot of our analysis on this and again, Securonix is focused on cyber, bringing in the data, analyzing it, so after a lot of our analysis, we zeroed in on S3 as the core bucket where this data needs to be stored because the price point, the reliability, and all the other functions available on top of that. And with that, with S3, we've created a great partnership with AWS as well as with Snowflake that is providing this, from a data lake perspective, a bigger data lake, enterprise data lake perspective. So now for us to be able to provide customers the ability to search that data. So data comes in, we are enriching it. We are putting it in S3 in real time. Now, this is where Presto comes in. In our research, Presto came out as the best search engine to sit on top of S3. The engine is supported by companies like Facebook and Uber, and it is open source. So open source, like you asked the question. So for companies like us, we cannot depend on a very small technology company to offer mission critical capabilities because what if that company gets acquired, et cetera. In the case of open source, we are able to adopt it. We know there is a community behind it and it will be kind of available for us to use and we will be able to contribute in it for the longterm. Number two, from an open source perspective, we have a strong belief that customers own their own data. Traditionally, like Steven used the word locked in, it's a key term, customers have been locked in into proprietary formats in the past and those days are over. You should be, you own the data and you should be able to use it with us and with other systems of choice. So now you get into a data search engine like Presto, which scales independently of the storage. And then when we start looking at Presto, we came across Ahana. So for every open source system, you definitely need a sort of a for-profit company that invests in the community and then that takes the community forward. Because without a company like this, the community will die. So we are very excited about the partnership with Presto and Ahana. And Ahana provides us the ability to take Presto and cloudify it, or make the cloud operations work plus be our conduit to the Ahana community. Help us speed up certain items on the roadmap, help our team contribute to the community as well. And then you have to take a solution like Presto, you have to put it in the cloud, you have to make it scale, you have to put it on Kubernetes. Standard thing that you need to do in today's world to offer it as sort of a micro service into our architecture. So in all of those areas, that's where our partnership is with Ahana and Presto and S3 and we think, this is the search solution for the future. And with something like this, very soon, we will be able to offer our customers 12 months of data, searchable at extremely fast speeds at very reasonable price points and you will own your own data. So it has very significant business benefits for our customers with the technology partnership that we have set up here. So very excited about this. >> Sachin, it's very inspiring, a couple things there. One, decentralize on your own data, having a democratized, that piece is killer. Open source, great point. >> Absolutely. >> Company goes out of business, you don't want to lose the source code or get acquired or whatever. That's a key enabler. And then three, a fast managed service that has a commercial backing behind it. So, a great, and by the way, Snowflake wasn't around a couple of years ago. So like, so this is what we're talking about. This is the cloud scale. Steven, take us home with this point because this is what innovation looks like. Could you share why it's working? What's some of the things that people could walk away with and learn from as the new architecture for the new NextGen cloud is here, so this is a big part of and share how this works? >> That's right. As you heard from Sachin, every company is becoming data-driven and analytics are central to their business. There's more data and it needs to be analyzed at lower cost without the locked in and people want that flexibility. And so a slide three talks about what Ahana cloud for Presto does. It's the best Presto out of the box. It gives you very easy to use for your operations team. So it can be one or two people just managing this and they can get up to speed very quickly in 30 minutes, be up and running. And that jump starts their movement into an Open Data Lake analytics architecture. That architecture is going to be, it is the one that is at Facebook, Uber, Twitter, other large web scale, internet scale companies. And with the amount of data that's occurring, that's now becoming the standard architecture for everyone else in the future. And so just to wrap, we're really excited about making that easy, giving an open source solution because the open source data stack based off of data lake analytics is really happening. >> I got to ask you, you've seen many waves on the industry. Certainly, you've been through the big data waves, Steven. Sachin, you're on the cutting edge and just the cutting edge billions of signals from one client alone is pretty amazing scale and refactoring that value proposition is super important. What's different from 10 years ago when the Hadoop, you mentioned Hadoop earlier, which is RIP, obviously the cloud killed it. We all know that. Everyone kind of knows that. But like, what's different now? I mean, skeptics might say, I don't believe you, but it's just crazy. There's no way it works. S3 costs way too much. Why is this now so much more of an attractive proposition? What do you say the naysayers out there? With Steve, we'll start with you and then Sachin, I want you to like weigh in too. >> Yeah. Well, if you think about the Hadoop era and if you look at slide three, it was a very complicated system that was done mainly on-prem. And you'd have to go and set up a big data team and a rack and stack a bunch of servers and then try to put all this stuff together and candidly, the results and the outcomes of that were very hard to get unless you had the best possible teams and invested a lot of money in this. What you saw in this slide was that, that right hand side which shows the stack. Now you have a separate compute, which is based off of Intel based instances in the cloud. We run the best in that and they're part of the Presto foundation. And that's now data lakes. Now the distributed compute engines are the ones that have become very much easier. So the big difference in what I see is no longer called big data. It's just called data analytics because it's now become commodified as being easy and the bar is much, much lower, so everyone can get the benefit of this across industries, across organizations. I mean, that's good for the world, reduces the security threats, the ransomware, in the case of Securonix and Sachin here. But every company can benefit from this. >> Sachin, this is really as an example in my mind and you can comment too on if you'd believe or not, but replatform with the cloud, that's a no brainer. People do that. They did it. But the value is refactoring in the cloud. It's thinking differently with the assets you have and making sure you're using the right pieces. I mean, there's no brainer, you know it's good. If it costs more money to stand up something than to like get value out of something that's operating at scale, much easier equation. What's your thoughts on this? Go back 10 years and where we are now, what's different? I mean, replatforming, refactoring, all kinds of happening. What's your take on all this? >> Agreed, John. So we have been in business now for about 10 to 11 years. And when we started my hair was all black. Okay. >> John: You're so silly. >> Okay. So this, everything has happened here is the transition from Hadoop to cloud. Okay. This is what the result has been. So people can see it for themselves. So when we started off with deep partnerships with the Hadoop providers and again, Hadoop is the foundation, which has now become EMR and everything else that AWS and other companies have picked up. But when you start with some basic premise, first, the racking and stacking of hardware, companies having to project their entire data volume upfront, bringing the servers and have 50, 100, 500 servers sitting in their data centers. And then when there are spikes in data, or like I said, as you move to the cloud, your data volume will increase between five to 20x and projecting for that. And then think about the agility that it will take you three to six months to bring in new servers and then bring them into the architecture. So big issue. Number two big issue is that the backend of that was built for HDFS. So Hadoop in my mind was built to ingest large amounts of data in batches and then perform some spark jobs on it, some analytics. But we are talking in security about real time, high velocity, high variety data, which has to be available in real time. It wasn't built for that, to be honest. So what was happening is, again, even if you look at the Hadoop companies today as they have kind of figured, kind of define their next generation, they have moved from HDFS to now kind of a cloud based platform capability and have discarded the traditional HDFS architecture because it just wasn't scaling, wasn't searching fast enough, wasn't searching fast enough for hundreds of analysts at the same time. And then obviously, the servers, et cetera wasn't working. Then when we worked with the Hadoop companies, they were always two to three versions behind for the individual services that they had brought together. And again, when you're talking about this kind of a volume, you need to be on the cutting edge always of the technologies underneath that. So even while we were working with them, we had to support our own versions of Kafka, Solr, Zookeeper, et cetera to really bring it together and provide our customers this capability. So now when we have moved to the cloud with solutions like EMR behind us, AWS has invested in in solutions like EMR to make them scalable, to have scale and then scale out, which traditional Hadoop did not provide because they missed the cloud wave. And then on top of that, again, rather than throwing data in that traditional older HDFS format, we are now taking the same format, the parquet format that it supports, putting it in S3 and now making it available and using all the capabilities like you said, the refactoring of that is critical. That rather than on-prem having servers and redundancies with S3, we get built in redundancy. We get built in life cycle management, high degree of confidence data reliability. And then we get all this innovation from companies like, from groups like Presto, companies like Ahana sitting on double that S3. And the last item I would say is in the cloud we are now able to offer multiple, have multiple resilient options on our side. So for example, with us, we still have some premium searching going on with solutions like Solr and Elasticsearch, then you have Presto and Ahana providing majority of our searching, but we still have Athena as a backup in case something goes down in the architecture. Our queries will spin back up to Athena, AWS service on Presto and customers will still get served. So all of these options, but what it doesn't cost us anything, Athena, if we don't use it, but all of these options are not available on-prem. So in my mind, I mean, it's a whole new world we are living in. It is a world where now we have made it possible for companies to even enterprises to even think about having true security data lakes, which are useful and having real-time analytics. From my perspective, I don't even sign up today for a large enterprise that wants to build a data lake on-prem because I know that is not, that is going to be a very difficult project to make it successful. So we've come a long way and there are several details around this that we've kind of endured through the process, but very excited where we are today. >> Well, we certainly follow up with theCUBE on all your your endeavors. Quickly on Ahana, why them, why their solution? In your words, what would be the advice you'd give me if I'm like, okay, I'm looking at this, why do I want to use it, and what's your experience? >> Right. So the standard SQL query engine for data lake analytics, more and more people have more data, want to have something that's based on open source, based on open formats, gives you that flexibility, pay as you go. You only pay for what you use. And so it proved to be the best option for Securonix to create a self-service system that has all the speed and performance and scalability that they need, which is based off of the innovation from the large companies like Facebook, Uber, Twitter. They've all invested heavily. We contribute to the open source project. It's a vibrant community. We encourage people to join the community and even Securonix, we'll be having engineers that are contributing to the project as well. I think, is that right Sachin? Maybe you could share a little bit about your thoughts on being part of the community. >> Yeah. So also why we chose Ahana, like John said. The first reason is you see Steven is always smiling. Okay. >> That's for sure. >> That is very important. I mean, jokes apart, you need a great partner. You need a great partner. You need a partner with a great attitude because this is not a sprint, this is a marathon. So the Ahana founders, Steven, the whole team, they're world-class, they're world-class. The depth that the CTO has, his experience, the depth that Dipti has, who's running the cloud solution. These guys are world-class. They are very involved in the community. We evaluated them from a community perspective. They are very involved. They have the depth of really commercializing an open source solution without making it too commercial. The right balance, where the founding companies like Facebook and Uber, and hopefully Securonix in the future as we contribute more and more will have our say and they act like the right stewards in this journey and then contribute as well. So and then they have chosen the right niche rather than taking portions of the product and making it proprietary. They have put in the effort towards the cloud infrastructure of making that product available easily on the cloud. So I think it's sort of a no-brainer from our side. Once we chose Presto, Ahana was the no-brainer and just the partnership so far has been very exciting and I'm looking forward to great things together. >> Likewise Sachin, thanks so much for that. And we've only found your team, you're world-class as well, and working together and we look forward to working in the community also in the Presto foundation. So thanks for that. >> Guys, great partnership. Great insight and really, this is a great example of cloud scale, cloud value proposition as it unlocks new benefits. Open source, managed services, refactoring the opportunities to create more value. Stephen, Sachin, thank you so much for sharing your story here on open data lakes. Can open always wins in my mind. This is theCUBE we're always open and we're showcasing all the hot startups coming out of the AWS ecosystem for the AWS Startup Showcase. I'm John Furrier, your host. Thanks for watching. (bright music)
SUMMARY :
leaders all around the world, of the AWS Startup Showcase. to help us through this, and provide all the what's going on with you guys, in the cloud and making it easy to use. Let's get into the Securonix So in the past, what was So in any event, Securonix on the cloud Some are saying that the and that's the S3-based data in the Linux foundation or open meaning And Presto is the layer in because I get the open data layer. and all the other functions that piece is killer. and learn from as the new architecture for everyone else in the future. obviously the cloud killed it. and the bar is much, much lower, But the value is refactoring in the cloud. So we have been in business and again, Hadoop is the foundation, be the advice you'd give me system that has all the speed The first reason is you see and just the partnership so in the community also in for the AWS Startup Showcase.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Steven | PERSON | 0.99+ |
Sachin | PERSON | 0.99+ |
John | PERSON | 0.99+ |
Steve | PERSON | 0.99+ |
Securonix | ORGANIZATION | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
John Furrier | PERSON | 0.99+ |
Steven Mih | PERSON | 0.99+ |
50 | QUANTITY | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
2010 | DATE | 0.99+ |
Stephen | PERSON | 0.99+ |
Sachin Nayyar | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
20 times | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
12 months | QUANTITY | 0.99+ |
three | QUANTITY | 0.99+ |
ORGANIZATION | 0.99+ | |
Ahana | PERSON | 0.99+ |
two customers | QUANTITY | 0.99+ |
90 days | QUANTITY | 0.99+ |
Ahana | ORGANIZATION | 0.99+ |
Palo Alto | LOCATION | 0.99+ |
100 | QUANTITY | 0.99+ |
30 minutes | QUANTITY | 0.99+ |
Presto | ORGANIZATION | 0.99+ |
hundreds of terabytes | QUANTITY | 0.99+ |
five | QUANTITY | 0.99+ |
First | QUANTITY | 0.99+ |
One | QUANTITY | 0.99+ |
two | QUANTITY | 0.99+ |
hundreds | QUANTITY | 0.99+ |
six months | QUANTITY | 0.99+ |
S3 | TITLE | 0.99+ |
Zookeeper | TITLE | 0.99+ |
Mai Lan Tomsen Bukovec, AWS | theCUBE on Cloud 2021
>>from around the globe. It's the Cube presenting Cuban cloud brought to you by silicon angle. >>We continue >>with Cuban Cloud. We're here with Milan Thompson Bukovec, who's the vice president? Block and object storage at A W s, which comprise comprises elastic block storage, AWS s three and Amazon Glacier. Milan. Great to see you again. Thanks so much for coming on the program. >>Nice to be here. Thanks for having me, David. >>You're very welcome it So here we are. We're unpacking the future of cloud. And we'd love to get your perspectives on how customers should think about the future of infrastructure, things like applying machine intelligence to their data. But just to set the stage when we look back at the history of storage in the Cloud is obviously started with us three. And then a couple years later was introduced CBS for block storage. And those are the most well known services in the portfolio. But there's there's Mawr, this cold storage and new capabilities that you announced recently. It reinvent around, you know, super duper block storage and in tearing is another example. But it looks like AWS is really starting to accelerate and pick up the pace of customer >>options in >>storage. So my first question is, how should we think about this expanding portfolio? >>Well, I think you have to go all the way back to what customers air trying to do with their data. Dave, The path to innovation is paved by data. If you don't have data, you don't have machine learning. You don't have the next generation of analytics applications. That helps you chart a path forward into a world that seems to be changing every week. And so in orderto have that insight in orderto have that predictive forecasting that every company needs, regardless of what industry that you're in today. It all starts from data, and I think the key shift that I've seen is how customers are thinking about that data about being instantly usable, whereas in the past it might have been a backup. Now it's part of a data lake, and if you could bring that data into a data lake, you can have not just analytics or machine learning or auditing applications. It's really what does your application do for your business, and how can it take advantage of that vast amount of shared data set in your business. Awesome. >>So thank you. So I wanna I wanna make sure we're hitting on the big trends that you're seeing in the market. That kind of informing your strategy around the portfolio and what you're seeing with customers Instant usability. You you bring in machine learning into the equation. I think, um, people have really started to understand the benefits of of of cloud storage as a service on the pay paid by the drink and that whole whole model, obviously co vid has accelerated that cloud migration has accelerated. Anything else we're missing there. What are the other big trends that you see if any? >>Well, Dave, you did a good job of capturing a lot of the drivers. The one thing I would say that just sits underneath All of it is the massive growth of digital data year over year I. D. C. Says digital data is growing at a rate of 40% year over year, and that has been true for a while. And it's not going to stop. It's gonna keep on growing because the sources of that data acquisition keeps on expanding and whether it's coyote devices whether it is content created by users. That data is going to grow, and everything you're talking about depends on the ability to not just capture it and store it. But as you say, use it well, >>you know, and we talk about data growth a lot, and sometimes it becomes bromide. But I think the interesting thing that I've observed over the last a couple of decades really is that the growth is nonlinear on. It's really the curve is starting. Thio used to shape exponentially. You guys always talk about that flywheel. Effect it. It's really hard to believe, You know, people say trees don't grow to the moon. It seems like data does. >>It does. And what's interesting about working in the world of AWS storage Dave is that it's counterintuitive. But our goal without data growth is to make it cost effective. And so year over year, how could we make it cheaper and cheaper? Just have customers store more and more data so they can use it. But it's also to think about the definition of usage. And what kind of data is that? Eyes being tapped by businesses for their insights and make that easier than it's ever been before. Let me ask >>you a follow up question on that my life could I get asked this a lot? Or guy here comments a lot that yes, A W s continuously and rigorously reduces pricing. But it's just >>kind of >>following the natural curve of Moore's law or, you know, whatever. How >>do you >>respond to that? And there are other factors involved. Obviously, labor is another cost reducing factor. But what's the trend line say, >>Well, cost efficiencies in our DNA, Dave. We come to work every day and aws across all of our services, and we ask ourselves, How can we lower our costs and be able to pass that along to customers? As you say, there are many different aspects to cost. There's the cost of the storage itself is the cost of the data center. And that's really what we've seen impact a lot of customers that were slower or just getting started with removed. The cloud is they entered 2020 and then they found out exactly how expensive that data center was to maintain because they had to put in safety equipment and they had to do all the things that you have to do in a pandemic in a data center. And so sometimes that cost is a little bit hidden or won't show up until you really don't need to have it land. But the cost of managing that explosive growth of data is very riel. And when we're thinking about cost, we're thinking about cost in terms of how can I lower it on a per gigabyte per month basis? But we're also building into the product itself adaptive discounts like we have a storage class in S three that's called intelligent hearing. And in intelligence hearing, we have built in monitoring where, if particular objects aren't frequently accessed in a given month, ah, customer will automatically get a discounted price for that storage or a customer Can you know, as of late last year, say that they wanna automatically move storage in the storage class that has been stored, for example, longer than 100 and 80 days and saves 95% by moving it into archive storage, deep archives storage? And so it's not just, you know, relentlessly going after and lowering the cost of storage. It's also building into the products these new ways where we can adaptive Lee discount storage based on what a customer's storage is actually doing >>well. And I would, I would add to our audience, is the other thing that does has done is it's really forced transparency almost the same way that Amazon has done on retail. And now my mom, When we talked last I mentioned that s three was an object store. And of course, that's technically technically correct. But your comment to me was Dave. It's more than that. And you started to talk about sage Maker and AI and bringing in machine learning. And I wonder if you could talk a little bit about the future of how storage is gonna be leveraged in the cloud that's may be different than what we've been, you know, used to in the early days of s three and how your customers should be thinking about infrastructure not as bespoke services but as a suite of capabilities and maybe some of those adjacent adjacent services that you see as most leverage a ble for customers And why? >>Well, to tell this story, dude, we're gonna have to go a little bit back in time all the way back to the 19 nineties. Or before then, when all you had waas, a set of hardware appliance vendors that sold you appliances that you put in your data center and inherently created a data silo because those hardware appliances were hardwired to your application. And so an individual application that was dealing with auditing as an example wouldn't really be able to access the storage for another application. Because you know, the architecture er of that legacy world is tied to a data silo and s tree came out launched in 2000 and six and introduced very low cost storage. That is an object. And I'll tell you, Dave, you know, over the last 10 plus years, we have seen all kinds of data come into us three, whereas before it might have been backups or it might have been images and videos. Now a pretty substantial data set is our parquet files and orc files. Thes files are there for business analytics for more real time type of processing. And that has really been the trend of the future. Is taking these different files putting them in a shared file layer, So any application today or in the future can tap into that data. And so this idea of the shared file layer is a major trend that has been taking off for the last. I would say five or six years, and I expect that to not only keep on going, but to really open up the type of services that you can then do on that shared file layer and whether that sage maker or some of the machine learning introduced by our connect service, it's bringing together the data as a starting point. And then the applications can evolve very rapidly. On top of that, I want to >>ask your opinion about big data architectures. One of our guests, Jim Octagon E. She's amazing, uh, data architect, and she's put forth this notion of a distributed global mesh, and I picked him picking up on some of the comments. Andy Jassy made it at reinvent How essentially Hey, we're bringing a W s to the edge. We see the data center is just another edge. Notes. You're seeing this massive distributed system evolving. You guys have talked about that for a while, and data by its very nature is distributed. But we've had this tendency to put into it monolithic Data Lake or a data warehouse on bits sort of antithetical to that distributed nature. So how >>do >>you see that playing out? What do you see customers in the future doing in terms of their big data architectures? And what does that mean for storage? >>It comes down to the nature of the data and again, the usage and Dave. That's where I see the biggest difference in these modern data architectures from the legacy of 20 years ago is the idea that the data need drives the data storage. So let's taken example of the type of data that you always wanna have on the edge. We have customers today that need tohave storage in the field and whether the field of scientific research or oftentimes, it's content creation in the in the film industry or if it's for military operations. There's a lot of data that needs to be captured and analyzed in the field and for us, what that means is that you know we have a suite of products called Snowball and whether it's snowball or snow cone, take your pick. That whole portfolio of AWS services is targeted at customers that need to do work with storage at the edge. And so it you know, if you think about the need for multiple applications acting on the same data set, that's when you keep it in an AWS region. And what we've done in AWS storage is we've recognized that depending on the need of usage, where you put your data and how you interactive, it may vary. But we've built a whole set of services like data transfer to help make sure that we can connect data from, for example, that new snow cone into a region automatically. And so our goal Dave, is to make sure that when customers air operating at the edge or they're operating in the region, they have the same quality of storage service, and they have easy ways to go between them. You shouldn't have to pick. You should be able to do it all. >>So in the spirit of do it all, this is sort of age old dynamic in the tech business, where you've got the friction between the the best of breed and the integrated suite, and my question is around what you're optimizing for for customers. And can you have your cake and eat it too? In other words, why A W S storage does what makes a compelling? Is it because it's kind of a best of breed storage service? Or is it because it's integrated with a W S? Would you ever sub optimize one in in order to get an advantage to the other? Or can you actually, >>you >>know, have your cake and eat it, too? >>The way that we build storage is to focus on being both the breath of capabilities on the depth of capabilities. And so where we identify ah, particular need where we think that it takes a whole new service to deliver, we'll go build that service and example for that is FTP, our AWS sftp service, which you know there's a lot of sftp usage out there and there will be for a while because of the you know, the Legacy B two b type of architectures that still live in the business world today. And so we looked at that problem. We said, How are we gonna build that in the best depth way and the best focus? And we launched a separate service for them. And so our goal is to take the individual building blocks of CBS and Glacier and s three and make the best of class and the most comprehensive in the capabilities of what we can dio and where we identify very specific need. We'll go build a service for. But, Dave, you know, as an example for that idea of both depths and breath s three storage lands is a great example of that s three storage lands is a new capability that we launched last year. And what it does is it lets you look across all your regions and all your accounts and get a summary view of all your s three storage and whether that's buckets or, you know, the most active prefixes that you have and be able to drill down from that and that is built in to the S three service and available for any customer that wants to turn it on in the AWS Management Council. >>Right? And we we saw just recently made I called it super duper block storage. But you made some, you know, improvements and really addressing the highest performance. Um, I want to ask you So we've all learned about an experience the benefits of cloud over the last several years, and especially in the last 10 months during the pandemic. But one >>of >>the challenges, and it's particularly acute with bio is, of course, Leighton see and moving data around and accessing data remotely. It's It's a challenge for customers, you know, due to speed of light, etcetera. So my question is, how was a W s thinking about all that data that still resides on premises? I think we heard that reinvent. That's still 90% of the opportunities or or the workloads. They're still on Prem that live inside a customer's data center. So how do you tap into those and help customers innovate with on Prem data, particularly from a storage >>angle? Well, we always want to provide the best of class solution for those little Leighton see workloads, and that's why we launched Block Express just late last year. It reinvent and Black expresses a new capability and preview on top of our Iot to provisioned eye ops volume type, and what's really interesting about Block Express Dave, is that the way that we're able to deliver the performance of Block Express, which is sound performance with cloud elasticity, is that we went all the way down to the network layer and we customize the hardware software. And at the network Lehrer, we built a Block Express on something called SRD, which stands for a scalable, reliable diagrams. And basically, what is letting us to do is offload all of our EBS operations for Block Express on the Nitro card on hardware. And so that type of innovation where we're able Thio, you know, take advantage of modern cop commodity, multi tenant data center networks where we're sending in this new network protocol across a large number of network paths, and that that type of innovation all the way down to that protocol level helps us innovate in a way that's hard. In fact, I would say impossible for for other sound providers to kind of really catch up and keep up. And so we feel that the amount of innovation that we have for delivering those low latency workloads in our AWS cloud storage is is unlimited, really, Because of that ability to customize software, hardware and network protocols as we go along without requiring upgrades from a customer it just gets better and the customer benefits. Now if you want to stay in your data center, that's why we built outposts. And for outpost, we have EBS and we have s three for outposts. And our goal there is that some customers will have workloads where they want to keep them resident in the data center And for those customers, we want to give them that AWS storage opportunities as well. So >>thank you for coming back to block Express. So you call it in sand in the cloud eso Is that essentially you've you've comprises a custom built, essentially storage storage network. Is that is that right? What kind of what you just described? SRD? I think you call it. >>Yeah, it's SRT is used by other AWS services as well, but it is a custom network protocol that we designed to deliver the lowest latency experience on We're taking advantage of it with Block Express >>sticking with traditional data centers for a moment, I'm interested in your thoughts on the importance of the cloud you know, pricing approach I e. The consumption model to paid by the drink. Obviously, it's one of the most attractive features But But And I ask that because we're seeing what Andy Jassy first, who is the old Guard Institute? Flexible pricing models. Two of the biggest storage companies HP with Green Lake and Dell has this thing called Apex. They've announced such models for on Prem and and presumably, Cross Cloud. How >>do you think >>this is going to impact your customers Leverage of AWS cloud storage? Is it something that you have ah, opinion on? >>Yeah, I think it all comes down to again that usage of the storage And this is where I think there is an inherent advantage for our cloud storage. So there might be an attempt by the old guard toe lower prices or add flexibility. But the end of the day it comes down to what the customer actually needs to to. And if you think about gp three, which is the new E. B s volume, the idea with GP three is we're gonna pass along savings to the customer by making the storage 20% cheaper than GP two. And we're gonna make the product better by giving a great, reliable baseline performance. But we're also going to let customers who want to run work clothes like Cassandra on TBS tune their throughput separately, for example, from their capacity. So if you're running Cassandra, sometimes you don't need to change your capacity. Your storage capacity works just fine, but what happens with for example, Cassandra were quote is that you may need more throughput. And if you're buying hardware appliance, you just have to buy for your peak. You have to buy for the max of what you think, your throughput in the max of what your storage is and this inherent flexibility that we have for AWS storage and being able to tune throughput separate from IOP, separate from capacity like you do for GP three. That is really where the future is for customers having control over costs and control over customer experience without compromising or trading off either one. >>Awesome. Thank you for that. So another time we have remaining my line. I want to talk about the topic of diversity. Uh, social impact on Daz. Ah, woman leader, women executive on. I really wanna get your perspectives on this, and I've shared with the audience previously. One of my breaking analysis segments your your boxing video, which is awesome and eso so you've got a lot of unique, non traditional aspects to your to your life, and and I love it. But I >>want to >>ask you this. So it's obviously, you know, certainly politically and socially correct to talk about diversity, the importance of diversity. There's data that suggests that that that diversity is good both economically, not just socially. And of course, it's the right thing to do. But there are those. Peter Thiel is probably the most prominent, but there are others who say, You know what, >>But >>get that. Just hire people just like you will be able to go faster, ramp up more quickly, hit escape velocity. It's natural. And that's what you should dio. Why is that not the right approach? Why is diversity both course socially responsible, but also good for business? >>For Amazon, we think about diversity as something that is essential toe how we think about innovation. And so, Dave, you know, as you know, from listening to some of the announcements I reinvent, we launched a lot of new ideas, new concepts and new services in AWS and just bringing that lends down to storage U. S. Tree has been reinventing itself every year since we launched in 2000 and six. PBS introduced the first Son on the Cloud late last year and continues to reinvent how customers think about block storage. We would not be able Thio. Look at a product in a different way and think to ourselves Not just what is the legacy system dio in a data center today. But how do we want to build this new distributed system in a way that helps customers achieve not just what they're doing today, but what they want to do in five and 10 years? You can't get that innovative mindset without bringing different perspectives to the table. And so we strongly believe in hiring people who are from underrepresented groups and whether that's gender or it's related racial equality or if its geographic, uh, diversity and bringing them in tow have the conversation. Because those divers viewpoints inform how we can innovate at all levels in a W s >>right. And so I really appreciate the perspectives on that, and we've had a zoo. You probably know the Cube has been, you know, a very big advocate of diversity, you know, generally, but women in tech Specifically, we participated a lot. And you know, I often ask this question is, you know, as a smaller company, uh, I and some of my other colleagues in in small business Sometimes we struggle. Um and so my question is, how >>how do >>you go beyond What's your advice for going beyond, you know, the good old boys network? I think its large companies like AWS and the big players you've got a responsibility to that. You can put somebody in charge and make it you know, their full time job. How should smaller companies, um, that are largely white, male dominated? How should they become more diverse? What should they do? Thio increase that diversity? >>Well, I think the place to start his voice. A lot of what we try to dio is make sure that the underrepresented voice is heard. And so, Dave, any small business owner of any industry can encourage voice for your under represented or your unheard populations. And honestly, it is a simple as being in a meeting and looking around that table, we're on your screen as it were and asking yourself Who hasn't talked? Who hasn't weighed in particularly if the debate is contentious or even animated. And you will see, particularly if you note this. Over time you will see that there may be somebody and whether it's an underrepresented, a group or its ah woman whose early career or it's it's not. It's just a member of your team who happens to be a white male to who's not being hurt. And you can ask that person for their perspective. And that is a step that every one of us can and should do, which is asked toe, have everyone's voice at the table, toe listen and to weigh in on it. So I think that is something everyone should dio. I think if you are a member of an underrepresented groups, as for example, I'm Vietnamese American and I'm the female in Tech. I think it z something to think about how you can make sure that you're always taking that bold step forward. And it's one of the topics that we covered it at reinvent. We had a great discussion with a group of women CEOs, and a lot of it we talked about is being bolt, taking the challenge of being bold in tough situations, and that is an important thing, I think, for anybody to keep in mind, but especially for members of underrepresented groups, because sometimes Dave, that bold step that you kind of think of is like, Oh, I don't know if I should ask for that promotion or I don't know if I should volunteer for that project It's not. It's not a big ask, but it's big in your head. And so if you can internalize as a member of some, you know, a group that maybe hasn't heard or seen as much how you can take those bold challenges and step forward and learn, maybe fell also because that's how you learn. Then that is a way toe. Also have people learn and develop and become leaders in whatever industry it ISS. It's >>great advice, and I reminds me of, I mean, I think most of us can relate to that my land, because when we started in the industry, we may be timid. You didn't want to necessarily speak up, and I think it's incumbent upon those in a position of power. And by the way, power might just be running a meeting agenda to maybe calling those folks that are. Maybe it's not diversity of gender or, you know, our or race. And maybe it's just the underrepresented. Maybe that's a good way to start building muscle memory. So that's unique advice that I hadn't heard before. So thank you very much for that. Appreciate it. And, uh hey, listen, thanks so much for coming on the Cuban cloud. Uh, we're out of time and and really, always appreciate your perspectives. And you're doing a great job, and thank you. >>Great. Thank you, Dave. Thanks for having me and have a great day. >>All right? And keep it right, everybody. You're watching the cube on cloud right back.
SUMMARY :
cloud brought to you by silicon angle. Great to see you again. Nice to be here. capabilities that you announced recently. So my first question is, how should we think about this expanding portfolio? and if you could bring that data into a data lake, you can have not just analytics or What are the other big trends that you see if any? And it's not going to stop. that I've observed over the last a couple of decades really is that the growth is nonlinear And so year over year, how could we make it cheaper and cheaper? you a follow up question on that my life could I get asked this a lot? following the natural curve of Moore's law or, you know, And there are other factors involved. And so it's not just, you know, relentlessly going after And I wonder if you could talk a little bit about the future of how storage is gonna be leveraged in the cloud that's that you put in your data center and inherently created a data silo because those hardware We see the data center is just another And so it you know, if you think about the need And can you have your cake and eat it too? And what it does is it lets you look across all your regions and all your you know, improvements and really addressing the highest performance. It's It's a challenge for customers, you know, And at the network Lehrer, we built a Block Express on something called SRD, What kind of what you just described? Two of the biggest storage companies HP with Green Lake and Dell has this thing called Apex. But the end of the day it comes down to what the customer actually Thank you for that. And of course, it's the right thing to do. And that's what you should dio. Dave, you know, as you know, from listening to some of the announcements I reinvent, we launched a lot You probably know the Cube has been, you know, a very big advocate of diversity, You can put somebody in charge and make it you know, their full time job. And so if you can internalize as a member And maybe it's just the underrepresented. And keep it right, everybody.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Dave | PERSON | 0.99+ |
David | PERSON | 0.99+ |
Andy Jassy | PERSON | 0.99+ |
Dell | ORGANIZATION | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
PBS | ORGANIZATION | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
HP | ORGANIZATION | 0.99+ |
90% | QUANTITY | 0.99+ |
Two | QUANTITY | 0.99+ |
40% | QUANTITY | 0.99+ |
Peter Thiel | PERSON | 0.99+ |
five | QUANTITY | 0.99+ |
20% | QUANTITY | 0.99+ |
six years | QUANTITY | 0.99+ |
2020 | DATE | 0.99+ |
2000 | DATE | 0.99+ |
last year | DATE | 0.99+ |
first question | QUANTITY | 0.99+ |
Green Lake | ORGANIZATION | 0.99+ |
95% | QUANTITY | 0.99+ |
three | QUANTITY | 0.99+ |
80 days | QUANTITY | 0.99+ |
CBS | ORGANIZATION | 0.99+ |
10 years | QUANTITY | 0.99+ |
Apex | ORGANIZATION | 0.99+ |
both | QUANTITY | 0.99+ |
TBS | ORGANIZATION | 0.98+ |
Moore | PERSON | 0.98+ |
Mai Lan Tomsen Bukovec | PERSON | 0.98+ |
one | QUANTITY | 0.98+ |
Guard Institute | ORGANIZATION | 0.98+ |
19 nineties | DATE | 0.98+ |
20 years ago | DATE | 0.97+ |
late last year | DATE | 0.97+ |
longer than 100 | QUANTITY | 0.96+ |
late last year | DATE | 0.95+ |
One | QUANTITY | 0.95+ |
today | DATE | 0.95+ |
Cuban | OTHER | 0.94+ |
Milan Thompson Bukovec | PERSON | 0.94+ |
late last year | DATE | 0.94+ |
pandemic | EVENT | 0.94+ |
AWS Management Council | ORGANIZATION | 0.93+ |
a couple years later | DATE | 0.91+ |
Leighton | ORGANIZATION | 0.91+ |
last 10 months | DATE | 0.91+ |
EBS | ORGANIZATION | 0.9+ |
Jim Octagon E. | PERSON | 0.89+ |
first | QUANTITY | 0.89+ |
gp three | TITLE | 0.87+ |
Block Express | COMMERCIAL_ITEM | 0.87+ |
S. Tree | LOCATION | 0.86+ |
Cloud 2021 | TITLE | 0.85+ |
Ed Walsh, ChaosSearch | AWS re:Invent 2020 Partner Network Day
>> Narrator: From around the globe it's theCUBE, with digital coverage of AWS re:Invent 2020. Special coverage sponsored by AWS Global Partner Network. >> Hello and welcome to theCUBE Virtual and our coverage of AWS re:Invent 2020 with special coverage of APN partner experience. We are theCUBE Virtual and I'm your host, Justin Warren. And today I'm joined by Ed Walsh, CEO of ChaosSearch. Ed, welcome to theCUBE. >> Well thank you for having me, I really appreciate it. >> Now, this is not your first time here on theCUBE. You're a regular here and I've loved it to have you back. >> I love the platform you guys are great. >> So let's start off by just reminding people about what ChaosSearch is and what do you do there? >> Sure, the best way to say is so ChaosSearch helps our clients know better. We don't do that by a special wizard or a widget that you give to your, you know, SecOp teams. What we do is a hard work to give you a data platform to get insights at scale. And we do that also by achieving the promise of data lakes. So what we have is a Chaos data platform, connects and indexes data in a customer's S3 or glacier accounts. So inside your data lake, not our data lake but renders that data fully searchable and available for analysis using your existing tools today 'cause what we do is index it and publish open API, it's like API like Elasticsearch API, and soon SQL. So give you an example. So based upon those capabilities were an ideal replacement for a commonly deployed, either Elasticsearch or ELK Stack deployments, if you're hitting scale issues. So we talk about scalable log analytics, and more and more people are hitting these scale issues. So let's say if you're using Elasticsearch ELK or Amazon Elasticsearch, and you're hitting scale issues, what I mean by that is like, you can't keep enough retention. You want longer retention, or it's getting very expensive to keep that retention, or because the scale you hit where you have availability, where the cluster is hard to keep up running or is crashing. That's what we mean by the issues at scale. And what we do is simply we allow you, because we're publishing the open API of Elasticsearch use all your tools, but we save you about 80% off your monthly bill. We also give you an, and it's an and statement and give you unlimited retention. And as much as you want to keep on S3 or into Glacier but we also take care of all the hassles and management and the time to manage these clusters, which ends up being on a database server called leucine. And we take care of that as a managed service. And probably the biggest thing is all of this without changing anything your end users are using. So we include Kibana, but imagine it's an Elastic API. So if you're using API or Kibana, it's just easy to use the exact same tools used today, but you get the benefits of a true data lake. In fact, we're running now Elasticsearch on top of S3 natively. If that makes it sense. >> Right and natively is pretty cool. And look, 80% savings, is a dramatic number, particularly this year. I think there's a lot of people who are looking to save a few quid. So it'd be very nice to be able to save up to 80%. I am curious as to how you're able to achieve that kind of saving though. >> Yeah, you won't be the first person to ask me that. So listen, Elastic came around, it was, you know we had Splunk and we also have a lot of Splunk clients, but Elastic was a more cost effective solution open source to go after it. But what happens is, especially at scale, if it's fall it's actually very cost-effective. But underneath last six tech ELK Stack is a leucine database, it's a database technology. And that sits on our servers that are heavy memory count CPU count in and SSDs. So you can do on-prem or even in the clouds, so if you do an Amazon, basically you're spinning up a server and it stays up, it doesn't spin up, spin down. So those clusters are not one server, it's a cluster of those servers. And typically if you have any scale you're actually having multiple clusters because you don't dare put it on one, for different use cases. So our savings are actually you no longer need those servers to spin up and you don't need to pay for those seen underneath. You can still use Kibana under API but literally it's $80 off your bill that you're paying for your service now, and it's hard dollars. So it's not... And we typically see clients between 70 and 80%. It's up to 80, but it's literally right within a 10% margin that you're saving a lot of money, but more importantly, saving money is a great thing. But now you have one unified data lake that you can have. You used to go across some of the data or all the data through the role-based access. You can give different people. Like we've seen people who say, hey give that, help that person 40 days of this data. But the SecOp up team gets to see across all the different law. You know, all the machine generated data they have. And we can give you a couple of examples of that and walk you through how people deploy if you want. >> I'm always keen to hear specific examples of how customers are doing things. And it's nice that you've thought of drawn that comparison there around what what cloud is good for and what it isn't is. I'll often like to say that AWS is cheap to fail in, but expensive to succeed. So when people are actually succeeding with this and using this, this broad amount of data so what you're saying there with that savings I've actually got access to a lot more data that I can do things with. So yeah, if you could walk through a couple of examples of what people are doing with this increased amount of data that they have access to in EKL Search, what are some of the things that people are now able to unlock with that data? >> Well, literally it's always good for a customer size so we can go through and we go through it however it might want, Kleiner, Blackboard, Alert Logic, Armor Security, HubSpot. Maybe I'll start with HubSpot. One of our good clients, they were doing some Cloud Flare data that was one of their clusters they were using a lot to search for. But they were looking at to look at a denial service. And they were, we find everyone kind of at scale, they get limited. So they were down to five days retention. Why? Well, it's not that they meant to but basically they couldn't cost-effectively handle that in the scale. And also they're having scale issues with the environment, how they set the cluster and sharding. And when they also denial service tech, what happened that's when the influx of data that is one thing about scale is how fast it comes out, yet another one is how much data you have. But this is as the data was coming after them at denial service, that's when the cluster would actually go down believe it or not, you know right. When you need your log analysis tools. So what we did is because they're just using Kibana, it was easy swap. They ran in parallel because we published the open API but we took them from five days to nine days. They could keep as much as they want but nine days for denial services is what they wanted. And then we did save them in over $4 million a year in hard dollars, What they're paying in their environment from really is the savings on the server farm and a little bit on the Elasticsearch Stack. But more importantly, they had no outages since. Now here's the thing. Are you talking about the use case? They also had other clusters and you find everyone does it. They don't dare put it on one cluster, even though these are not one server, they're multiple servers. So the next use case for CloudFlare was one, the next QS and it was a 10 terabyte a day influx kept it for 90 days. So it's about a petabyte. They brought another use case on which was NetMon, again, Network Monitoring. And again, I'm having the same scale issue, retention area. And what they're able to do is easily roll that on. So that's one data platform. Now they're adding the next one. They have about four different use cases and it's just different clusters able to bring together. But now what they're able to do give you use cases either they getting more cost effective, more stability and freedom. We say saves you a lot of time, cost and complexity. Just the time they manage that get the data in the complexities around it. And then the cost is easy to kind of quantify but they've got better but more importantly now for particular teams they only need their access to one data but the SecOP team wants to see across all the data. And it's very easy for them to see across all the data where before it was impossible to do. So now they have multiple large use cases streaming at them. And what I love about that particular case is at one point they were just trying to test our scale. So they started tossing more things at it, right. To see if they could kind of break us. So they spiked us up to 30 terabytes a day which is for Elastic would even 10 terabytes a day makes things fall over. Now, if you think of what they just did, what were doing is literally three steps, put your data in S3 and as fast as you can, don't modify, just put it there. Once it's there three steps connect to us, you give us readability access to those buckets and a place to write the indexy. All of that stuff is in your S3, it never comes out. And then basically you set up, do you want to do live or do you want to do real time analysis? Or do you want to go after old data? We do the rest, we ingest, we normalize the schema. And basically we give you our back and the refinery to give the right people access. So what they did is they basically throw a whole bunch of stuff at it. They were trying to outrun S3. So, you know, we're on shoulders of giants. You know, if you think about our platform for clients what's a better dental like than S3. You're not going to get a better cross curve, right? You're not going to get a better parallelism. And so, or security it's in your, you know a virtual environment. But if you... And also you can keep data in the right location. So Blackboard's a good example. They need to keep that in all the different regions and because it's personal data, they, you know, GDPR they got to keep data in that location. It's easy, we just put compute in each one of the different areas they are. But the net net is if you think that architecture is shoulders of giants if you think you can outrun by just sheer volume or you can put in more cost-effective place to keep long-term or you think you can out store you have so much data that S3 and glacier can't possibly do it. Then you got me at your bigger scale at me but that's the scale we'r&e talking about. So if you think about the spiked our throughput what they really did is they try to outrun S3. And we didn't pick up. Now, the next thing is they tossed a bunch of users at us which were just spinning up in our data fabric different ways to do the indexing, to keep up with it. And new use cases in case they're going after everyone gets their own worker nodes which are all expected to fail in place. So again, they did some of that but really they're like you guys handled all the influx. And if you think about it, it's the shoulders of giants being on top of an Amazon platform, which is amazing. You're not going to get a more cost effective data lake in the world, and it's continuing to fall in price. And it's a cost curve, like no other, but also all that resiliency, all that security and the parallelism you can get, out of an S3 Glacier is just a bar none is the most scalable environment, you can build an environment. And what we do is a thin layer. It's a data platform that allows you to have your data now fully searchable and queryable using your tools >> Right and you, you mentioned there that, I mean you're running in AWS, which has broad experience in doing these sorts of things at scale but on that operational management side of things. As you mentioned, you actually take that off, off the hands of customers so that you run it on their behalf. What are some of the areas that you see people making in trying to do this themselves, when you've gone into customers, and brought it into the EKL Search platform? >> Yeah, so either people are just trying their best to build out clusters of Elasticsearch or they're going to services like Logz.io, Sumo Logic or Amazon Elasticsearch services. And those are all basically on the same ELK Stack. So they have the exact same limits as the same bits. Then we see people trying to say, well I really want to go to a data lake. I want to get away from these database servers and which have their limits. I want to use a data Lake. And then we see a lot of people putting data into environments before they, instead of using Elasticsearch, they want to use SQL type tools. And what they do is they put it into a Parquet or Presto form. It's a Presto dialect, but it into Parquet and structure it. And they go a lot of other way to, Hey it's in the data lake, but they end up building these little islands inside their data lake. And it's a lot of time to transform the data, to get it in a format that you can go after our tools. And then what we do is we don't make you do that. Just literally put the data there. And then what we do is we do the index and a polish API. So right now it's Elasticsearch in a very short time we'll publish Presto or the SQL dialect. You can use the same tool. So we do see people, either brute forcing and trying their best with a bunch of physical servers. We do see another group that says, you know, I want to go use an Athena use cases, or I want to use a there's a whole bunch of different startups saying, I do data lake or data lake houses. But they are, what they really do is force you to put things in the structure before you get insight. True data lake economics is literally just put it there, and use your tools natively to go after it. And that's where we're unique compared to what we see from our competition. >> Hmm, so with people who have moved into ChaosSearch, what's, let's say pick one, if you can, the most interesting example of what people have started to do with, with their data. What's new? >> That's good. Well, I'll give you another one. And so Armor Security is a good one. So Armor Security is a security service company. You know, thousands of clients doing great I mean a beautiful platform, beautiful business. And they won Rackspace as a partner. So now imagine thousand clients, but now, you know massive scale that to keep up with. So that would be an example but another example where we were able to come in and they were facing a major upgrade of their environment just to keep up, and they expose actually to their customers is how their customers do logging analytics. What we're able to do is literally simply because they didn't go below the API they use the exact same tools that are on top and in 30 days replaced that use case, save them tremendous amount of dollars. But now they're able to go back and have unlimited retention. They used to restrict their clients to 14 days. Now they have an opportunity to do a bunch of different things, and possible revenue opportunities and other. But allow them to look at their business differently and free up their team to do other things. And now they're, they're putting billing and other things into the same environment with us because one is easy it's scale but also freed up their team. No one has enough team to do things. And then the biggest thing is what people do interesting with our product is actually in their own tools. So, you know, we talk about Kibana when we do SQL again we talk about Looker and Tableau and Power BI, you know, the really interesting thing, and we think we did the hard work on the data layer which you can say is, you know I can about all the ways you consolidate the performance. Now, what becomes really interesting is what they're doing at the visibility level, either Kibana or the API or Tableau or Looker. And the key thing for us is we just say, just use the tools you're used to. Now that might be a boring statement, but to me, a great value proposition is not changing what your end users have to use. And they're doing amazing things. They're doing the exact same things they did before. They're just doing it with more data at bigger scale. And also they're able to see across their different machine learning data compared to being limited going at one thing at a time. And that getting the correlation from a unified data lake is really what we, you know we get very excited about. What's most exciting to our clients is they don't have to tell the users they have to use a different tool, which, you know, we'll decide if that's really interesting in this conversation. But again, I always say we didn't build a new algorithm that you going to give the SecOp team or a new pipeline cool widget that going to help the machine learning team which is another API we'll publish. But basically what we do is a hard work of making the data platform scalable, but more importantly give you the APIs that you're used to. So it's the platform that you don't have to change what your end users are doing, which is a... So we're kind of invisible behind the scenes. >> Well, that's certainly a pretty strong proposition there and I'm sure that there's plenty of scope for customers to come and and talk to you because no one's creating any less data. So Ed, thanks for coming out of theCUBE. It's always great to see you here. >> Know, thank you. >> You've been watching theCUBE Virtual and our coverage of AWS re:Invent 2020 with special coverage of APN partner experience. Make sure you check out all our coverage online, either on your desktop, mobile on your phone, wherever you are. I've been your host, Justin Warren. And I look forward to seeing you again soon. (soft music)
SUMMARY :
the globe it's theCUBE, and our coverage of AWS re:Invent 2020 Well thank you for having me, loved it to have you back. and the time to manage these clusters, be able to save up to 80%. And we can give you a So yeah, if you could walk and the parallelism you can get, that you see people making it's in the data lake, but they end up what's, let's say pick one, if you can, I can about all the ways you It's always great to see you here. And I look forward to
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Justin Warren | PERSON | 0.99+ |
Ed Walsh | PERSON | 0.99+ |
$80 | QUANTITY | 0.99+ |
40 days | QUANTITY | 0.99+ |
five days | QUANTITY | 0.99+ |
Ed Walsh | PERSON | 0.99+ |
90 days | QUANTITY | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
AWS Global Partner Network | ORGANIZATION | 0.99+ |
nine days | QUANTITY | 0.99+ |
80% | QUANTITY | 0.99+ |
10 terabytes | QUANTITY | 0.99+ |
thousands | QUANTITY | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
HubSpot | ORGANIZATION | 0.99+ |
Ed | PERSON | 0.99+ |
10% | QUANTITY | 0.99+ |
Elasticsearch | TITLE | 0.99+ |
30 days | QUANTITY | 0.99+ |
Armor Security | ORGANIZATION | 0.99+ |
14 days | QUANTITY | 0.99+ |
thousand clients | QUANTITY | 0.99+ |
Blackboard | ORGANIZATION | 0.99+ |
Kleiner | ORGANIZATION | 0.99+ |
S3 | TITLE | 0.99+ |
One | QUANTITY | 0.99+ |
Alert Logic | ORGANIZATION | 0.99+ |
three steps | QUANTITY | 0.98+ |
one | QUANTITY | 0.98+ |
GDPR | TITLE | 0.98+ |
one thing | QUANTITY | 0.98+ |
one data | QUANTITY | 0.98+ |
one server | QUANTITY | 0.98+ |
Elastic | TITLE | 0.98+ |
70 | QUANTITY | 0.98+ |
SQL | TITLE | 0.98+ |
about 80% | QUANTITY | 0.97+ |
Kibana | TITLE | 0.97+ |
first time | QUANTITY | 0.97+ |
over $4 million a year | QUANTITY | 0.97+ |
one cluster | QUANTITY | 0.97+ |
first person | QUANTITY | 0.97+ |
CloudFlare | TITLE | 0.97+ |
ChaosSearch | ORGANIZATION | 0.97+ |
this year | DATE | 0.97+ |
Glacier | TITLE | 0.97+ |
up to 80% | QUANTITY | 0.97+ |
Parquet | TITLE | 0.96+ |
each one | QUANTITY | 0.95+ |
Splunk | ORGANIZATION | 0.95+ |
Sumo Logic | ORGANIZATION | 0.94+ |
up to 80 | QUANTITY | 0.94+ |
Power BI | TITLE | 0.93+ |
today | DATE | 0.93+ |
Rackspace | ORGANIZATION | 0.92+ |
up to 30 terabytes a day | QUANTITY | 0.92+ |
one point | QUANTITY | 0.91+ |
S3 Glacier | COMMERCIAL_ITEM | 0.91+ |
Elastic API | TITLE | 0.89+ |
UNLIST TILL 4/2 - End-to-End Security
>> Paige: Hello everybody and thank you for joining us today for the virtual Vertica BDC 2020. Today's breakout session is entitled End-to-End Security in Vertica. I'm Paige Roberts, Open Source Relations Manager at Vertica. I'll be your host for this session. Joining me is Vertica Software Engineers, Fenic Fawkes and Chris Morris. Before we begin, I encourage you to submit your questions or comments during the virtual session. You don't have to wait until the end. Just type your question or comment in the question box below the slide as it occurs to you and click submit. There will be a Q&A session at the end of the presentation and we'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to answer offline. Also, you can visit Vertica forums to post your questions there after the session. Our team is planning to join the forums to keep the conversation going, so it'll be just like being at a conference and talking to the engineers after the presentation. Also, a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slide. And before you ask, yes, this whole session is being recorded and it will be available to view on-demand this week. We'll send you a notification as soon as it's ready. I think we're ready to get started. Over to you, Fen. >> Fenic: Hi, welcome everyone. My name is Fen. My pronouns are fae/faer and Chris will be presenting the second half, and his pronouns are he/him. So to get started, let's kind of go over what the goals of this presentation are. First off, no deployment is the same. So we can't give you an exact, like, here's the right way to secure Vertica because how it is to set up a deployment is a factor. But the biggest one is, what is your threat model? So, if you don't know what a threat model is, let's take an example. We're all working from home because of the coronavirus and that introduces certain new risks. Our source code is on our laptops at home, that kind of thing. But really our threat model isn't that people will read our code and copy it, like, over our shoulders. So we've encrypted our hard disks and that kind of thing to make sure that no one can get them. So basically, what we're going to give you are building blocks and you can pick and choose the pieces that you need to secure your Vertica deployment. We hope that this gives you a good foundation for how to secure Vertica. And now, what we're going to talk about. So we're going to start off by going over encryption, just how to secure your data from attackers. And then authentication, which is kind of how to log in. Identity, which is who are you? Authorization, which is now that we know who you are, what can you do? Delegation is about how Vertica talks to other systems. And then auditing and monitoring. So, how do you protect your data in transit? Vertica makes a lot of network connections. Here are the important ones basically. There are clients talk to Vertica cluster. Vertica cluster talks to itself. And it can also talk to other Vertica clusters and it can make connections to a bunch of external services. So first off, let's talk about client-server TLS. Securing data between, this is how you secure data between Vertica and clients. It prevents an attacker from sniffing network traffic and say, picking out sensitive data. Clients have a way to configure how strict the authentication is of the server cert. It's called the Client SSLMode and we'll talk about this more in a bit but authentication methods can disable non-TLS connections, which is a pretty cool feature. Okay, so Vertica also makes a lot of network connections within itself. So if Vertica is running behind a strict firewall, you have really good network, both physical and software security, then it's probably not super important that you encrypt all traffic between nodes. But if you're on a public cloud, you can set up AWS' firewall to prevent connections, but if there's a vulnerability in that, then your data's all totally vulnerable. So it's a good idea to set up inter-node encryption in less secure situations. Next, import/export is a good way to move data between clusters. So for instance, say you have an on-premises cluster and you're looking to move to AWS. Import/Export is a great way to move your data from your on-prem cluster to AWS, but that means that the data is going over the open internet. And that is another case where an attacker could try to sniff network traffic and pull out credit card numbers or whatever you have stored in Vertica that's sensitive. So it's a good idea to secure data in that case. And then we also connect to a lot of external services. Kafka, Hadoop, S3 are three of them. Voltage SecureData, which we'll talk about more in a sec, is another. And because of how each service deals with authentication, how to configure your authentication to them differs. So, see our docs. And then I'd like to talk a little bit about where we're going next. Our main goal at this point is making Vertica easier to use. Our first objective was security, was to make sure everything could be secure, so we built relatively low-level building blocks. Now that we've done that, we can identify common use cases and automate them. And that's where our attention is going. Okay, so we've talked about how to secure your data over the network, but what about when it's on disk? There are several different encryption approaches, each depends on kind of what your use case is. RAID controllers and disk encryption are mostly for on-prem clusters and they protect against media theft. They're invisible to Vertica. S3 and GCP are kind of the equivalent in the cloud. They also invisible to Vertica. And then there's field-level encryption, which we accomplish using Voltage SecureData, which is format-preserving encryption. So how does Voltage work? Well, it, the, yeah. It encrypts values to things that look like the same format. So for instance, you can see date of birth encrypted to something that looks like a date of birth but it is not in fact the same thing. You could do cool stuff like with a credit card number, you can encrypt only the first 12 digits, allowing the user to, you know, validate the last four. The benefits of format-preserving encryption are that it doesn't increase database size, you don't need to alter your schema or anything. And because of referential integrity, it means that you can do analytics without unencrypting the data. So again, a little diagram of how you could work Voltage into your use case. And you could even work with Vertica's row and column access policies, which Chris will talk about a bit later, for even more customized access control. Depending on your use case and your Voltage integration. We are enhancing our Voltage integration in several ways in 10.0 and if you're interested in Voltage, you can go see their virtual BDC talk. And then again, talking about roadmap a little, we're working on in-database encryption at rest. What this means is kind of a Vertica solution to encryption at rest that doesn't depend on the platform that you're running on. Encryption at rest is hard. (laughs) Encrypting, say, 10 petabytes of data is a lot of work. And once again, the theme of this talk is everyone has a different key management strategy, a different threat model, so we're working on designing a solution that fits everyone. If you're interested, we'd love to hear from you. Contact us on the Vertica forums. All right, next up we're going to talk a little bit about access control. So first off is how do I prove who I am? How do I log in? So, Vertica has several authentication methods. Which one is best depends on your deployment size/use case. Again, theme of this talk is what you should use depends on your use case. You could order authentication methods by priority and origin. So for instance, you can only allow connections from within your internal network or you can enforce TLS on connections from external networks but relax that for connections from your internal network. That kind of thing. So we have a bunch of built-in authentication methods. They're all password-based. User profiles allow you to set complexity requirements of passwords and you can even reject non-TLS connections, say, or reject certain kinds of connections. Should only be used by small deployments because you probably have an LDAP server, where you manage users if you're a larger deployment and rather than duplicating passwords and users all in LDAP, you should use LDAP Auth, where Vertica still has to keep track of users, but each user can then use LDAP authentication. So Vertica doesn't store the password at all. The client gives Vertica a username and password and Vertica then asks the LDAP server is this a correct username or password. And the benefits of this are, well, manyfold, but if, say, you delete a user from LDAP, you don't need to remember to also delete their Vertica credentials. You can just, they won't be able to log in anymore because they're not in LDAP anymore. If you like LDAP but you want something a little bit more secure, Kerberos is a good idea. So similar to LDAP, Vertica doesn't keep track of who's allowed to log in, it just keeps track of the Kerberos credentials and it even, Vertica never touches the user's password. Users log in to Kerberos and then they pass Vertica a ticket that says "I can log in." It is more complex to set up, so if you're just getting started with security, LDAP is probably a better option. But Kerberos is, again, a little bit more secure. If you're looking for something that, you know, works well for applications, certificate auth is probably what you want. Rather than hardcoding a password, or storing a password in a script that you use to run an application, you can instead use a certificate. So, if you ever need to change it, you can just replace the certificate on disk and the next time the application starts, it just picks that up and logs in. Yeah. And then, multi-factor auth is a feature request we've gotten in the past and it's not built-in to Vertica but you can do it using Kerberos. So, security is a whole application concern and fitting MFA into your workflow is all about fitting it in at the right layer. And we believe that that layer is above Vertica. If you're interested in more about how MFA works and how to set it up, we wrote a blog on how to do it. And now, over to Chris, for more on identity and authorization. >> Chris: Thanks, Fen. Hi everyone, I'm Chris. So, we're a Vertica user and we've connected to Vertica but once we're in the database, who are we? What are we? So in Vertica, the answer to that questions is principals. Users and roles, which are like groups in other systems. Since roles can be enabled and disabled at will and multiple roles can be active, they're a flexible way to use only the privileges you need in the moment. For example here, you've got Alice who has Dbadmin as a role and those are some elevated privileges. She probably doesn't want them active all the time, so she can set the role and add them to her identity set. All of this information is stored in the catalog, which is basically Vertica's metadata storage. How do we manage these principals? Well, depends on your use case, right? So, if you're a small organization or maybe only some people or services need Vertica access, the solution is just to manage it with Vertica. You can see some commands here that will let you do that. But what if we're a big organization and we want Vertica to reflect what's in our centralized user management system? Sort of a similar motivating use case for LDAP authentication, right? We want to avoid duplication hassles, we just want to centralize our management. In that case, we can use Vertica's LDAPLink feature. So with LDAPLink, principals are mirrored from LDAP. They're synced in a considerable fashion from the LDAP into Vertica's catalog. What this does is it manages creating and dropping users and roles for you and then mapping the users to the roles. Once that's done, you can do any Vertica-specific configuration on the Vertica side. It's important to note that principals created in Vertica this way, support multiple forms of authentication, not just LDAP. This is a separate feature from LDAP authentication and if you created a user via LDAPLink, you could have them use a different form of authentication, Kerberos, for example. Up to you. Now of course this kind of system is pretty mission-critical, right? You want to make sure you get the right roles and the right users and the right mappings in Vertica. So you probably want to test it. And for that, we've got new and improved dry run functionality, from 9.3.1. And what this feature offers you is new metafunctions that let you test various parameters without breaking your real LDAPLink configuration. So you can mess around with parameters and the configuration as much as you want and you can be sure that all of that is strictly isolated from the live system. Everything's separated. And when you use this, you get some really nice output through a Data Collector table. You can see some example output here. It runs the same logic as the real LDAPLink and provides detailed information about what would happen. You can check the documentation for specifics. All right, so we've connected to the database, we know who we are, but now, what can we do? So for any given action, you want to control who can do that, right? So what's the question you have to ask? Sometimes the question is just who are you? It's a simple yes or no question. For example, if I want to upgrade a user, the question I have to ask is, am I the superuser? If I'm the superuser, I can do it, if I'm not, I can't. But sometimes the actions are more complex and the question you have to ask is more complex. Does the principal have the required privileges? If you're familiar with SQL privileges, there are things like SELECT, INSERT, and Vertica has a few of their own, but the key thing here is that an action can require specific and maybe even multiple privileges on multiple objects. So for example, when selecting from a table, you need USAGE on the schema and SELECT on the table. And there's some other examples here. So where do these privileges come from? Well, if the action requires a privilege, these are the only places privileges can come from. The first source is implicit privileges, which could come from owning the object or from special roles, which we'll talk about in a sec. Explicit privileges, it's basically a SQL standard GRANT system. So you can grant privileges to users or roles and optionally, those users and roles could grant them downstream. Discretionary access control. So those are explicit and they come from the user and the active roles. So the whole identity set. And then we've got Vertica-specific inherited privileges and those come from the schema, and we'll talk about that in a sec as well. So these are the special roles in Vertica. First role, DBADMIN. This isn't the Dbadmin user, it's a role. And it has specific elevated privileges. You can check the documentation for those exact privileges but it's less than the superuser. The PSEUDOSUPERUSER can do anything the real superuser can do and you can grant this role to whomever. The DBDUSER is actually a role, can run Database Designer functions. SYSMONITOR gives you some elevated auditing permissions and we'll talk about that later as well. And finally, PUBLIC is a role that everyone has all the time so anything you want to be allowed for everyone, attach to PUBLIC. Imagine this scenario. I've got a really big schema with lots of relations. Those relations might be changing all the time. But for each principal that uses this schema, I want the privileges for all the tables and views there to be roughly the same. Even though the tables and views come and go, for example, an analyst might need full access to all of them no matter how many there are or what there are at any given time. So to manage this, my first approach I could use is remember to run grants every time a new table or view is created. And not just you but everyone using this schema. Not only is it a pain, it's hard to enforce. The second approach is to use schema-inherited privileges. So in Vertica, schema grants can include relational privileges. For example, SELECT or INSERT, which normally don't mean anything for a schema, but they do for a table. If a relation's marked as inheriting, then the schema grants to a principal, for example, salespeople, also apply to the relation. And you can see on the diagram here how the usage applies to the schema and the SELECT technically but in Sales.foo table, SELECT also applies. So now, instead of lots of GRANT statements for multiple object owners, we only have to run one ALTER SCHEMA statement and three GRANT statements and from then on, any time that you grant some privileges or revoke privileges to or on the schema, to or from a principal, all your new tables and views will get them automatically. So it's dynamically calculated. Now of course, setting it up securely, is that you want to know what's happened here and what's going on. So to monitor the privileges, there are three system tables which you want to look at. The first is grants, which will show you privileges that are active for you. That is your user and active roles and theirs and so on down the chain. Grants will show you the explicit privileges and inherited_privileges will show you the inherited ones. And then there's one more inheriting_objects which will show all tables and views which inherit privileges so that's useful more for not seeing privileges themselves but managing inherited privileges in general. And finally, how do you see all privileges from all these sources, right? In one go, you want to see them together? Well, there's a metafunction added in 9.3.1. Get_privileges_description which will, given an object, it will sum up all the privileges for a current user on that object. I'll refer you to the documentation for usage and supported types. Now, the problem with SELECT. SELECT let's you see everything or nothing. You can either read the table or you can't. But what if you want some principals to see subset or a transformed version of the data. So for example, I have a table with personnel data and different principals, as you can see here, need different access levels to sensitive information. Social security numbers. Well, one thing I could do is I could make a view for each principal. But I could also use access policies and access policies can do this without introducing any new objects or dependencies. It centralizes your restriction logic and makes it easier to manage. So what do access policies do? Well, we've got row and column access policies. Rows will hide and column access policies will transform data in the row or column, depending on who's doing the SELECTing. So it transforms the data, as we saw on the previous slide, to look as requested. Now, if access policies let you see the raw data, you can still modify the data. And the implication of this is that when you're crafting access policies, you should only use them to refine access for principals that need read-only access. That is, if you want a principal to be able to modify it, the access policies you craft should let through the raw data for that principal. So in our previous example, the loader service should be able to see every row and it should be able to see untransformed data in every column. And as long as that's true, then they can continue to load into this table. All of this is of course monitorable by a system table, in this case access_policy. Check the docs for more information on how to implement these. All right, that's it for access control. Now on to delegation and impersonation. So what's the question here? Well, the question is who is Vertica? And that might seem like a silly question, but here's what I mean by that. When Vertica's connecting to a downstream service, for example, cloud storage, how should Vertica identify itself? Well, most of the time, we do the permissions check ourselves and then we connect as Vertica, like in this diagram here. But sometimes we can do better. And instead of connecting as Vertica, we connect with some kind of upstream user identity. And when we do that, we let the service decide who can do what, so Vertica isn't the only line of defense. And in addition to the defense in depth benefit, there are also benefits for auditing because the external system can see who is really doing something. It's no longer just Vertica showing up in that external service's logs, it's somebody like Alice or Bob, trying to do something. One system where this comes into play is with Voltage SecureData. So, let's look at a couple use cases. The first one, I'm just encrypting for compliance or anti-theft reasons. In this case, I'll just use one global identity to encrypt or decrypt with Voltage. But imagine another use case, I want to control which users can decrypt which data. Now I'm using Voltage for access control. So in this case, we want to delegate. The solution here is on the Voltage side, give Voltage users access to appropriate identities and these identities control encryption for sets of data. A Voltage user can access multiple identities like groups. Then on the Vertica side, a Vertica user can set their Voltage username and password in a session and Vertica will talk to Voltage as that Voltage user. So in the diagram here, you can see an example of how this is leverage so that Alice could decrypt something but Bob cannot. Another place the delegation paradigm shows up is with storage. So Vertica can store and interact with data on non-local file systems. For example, HGFS or S3. Sometimes Vertica's storing Vertica-managed data there. For example, in Eon mode, you might store your projections in communal storage in S3. But sometimes, Vertica is interacting with external data. For example, this usually maps to a user storage location in the Vertica side and it might, on the external storage side, be something like Parquet files on Hadoop. And in that case, it's not really Vertica's data and we don't want to give Vertica more power than it needs, so let's request the data on behalf of who needs it. Lets say I'm an analyst and I want to copy from or export to Parquet, using my own bucket. It's not Vertica's bucket, it's my data. But I want Vertica to manipulate data in it. So the first option I have is to give Vertica as a whole access to the bucket and that's problematic because in that case, Vertica becomes kind of an AWS god. It can see any bucket, any Vertica user might want to push or pull data to or from any time Vertica wants. So it's not good for the principals of least access and zero trust. And we can do better than that. So in the second option, use an ID and secret key pair for an AWS, IAM, if you're familiar, principal that does have access to the bucket. So I might use my, the analyst, credentials, or I might use credentials for an AWS role that has even fewer privileges than I do. Sort of a restricted subset of my privileges. And then I use that. I set it in Vertica at the session level and Vertica will use those credentials for the copy export commands. And it gives more isolation. Something that's in the works is support for keyless delegation, using assumable IAM roles. So similar benefits to option two here, but also not having to manage keys at the user level. We can do basically the same thing with Hadoop and HGFS with three different methods. So first option is Kerberos delegation. I think it's the most secure. It definitely, if access control is your primary concern here, this will give you the tightest access control. The downside is it requires the most configuration outside of Vertica with Kerberos and HGFS but with this, you can really determine which Vertica users can talk to which HGFS locations. Then, you've got secure impersonation. If you've got a highly trusted Vertica userbase, or at least some subset of it is, and you're not worried about them doing things wrong but you want to know about auditing on the HGFS side, that's your primary concern, you can use this option. This diagram here gives you a visual overview of how that works. But I'll refer you to the docs for details. And then finally, option three, this is bringing your own delegation token. It's similar to what we do with AWS. We set something in the session level, so it's very flexible. The user can do it at an ad hoc basis, but it is manual, so that's the third option. Now on to auditing and monitoring. So of course, we want to know, what's happening in our database? It's important in general and important for incident response, of course. So your first stop, to answer this question, should be system tables. And they're a collection of information about events, system state, performance, et cetera. They're SELECT-only tables, but they work in queries as usual. The data is just loaded differently. So there are two types generally. There's the metadata table, which stores persistent information or rather reflects persistent information stored in the catalog, for example, users or schemata. Then there are monitoring tables, which reflect more transient information, like events, system resources. Here you can see an example of output from the resource pool's storage table which, these are actually, despite that it looks like system statistics, they're actually configurable parameters for using that. If you're interested in resource pools, a way to handle users' resource allocation and various principal's resource allocation, again, check that out on the docs. Then of course, there's the followup question, who can see all of this? Well, some system information is sensitive and we should only show it to those who need it. Principal of least privilege, right? So of course the superuser can see everything, but what about non-superusers? How do we give access to people that might need additional information about the system without giving them too much power? One option's SYSMONITOR, as I mentioned before, it's a special role. And this role can always read system tables but not change things like a superuser would be able to. Just reading. And another option is the RESTRICT and RELEASE metafunctions. Those grant and revoke access to from a certain system table set, to and from the PUBLIC role. But the downside of those approaches is that they're inflexible. So they only give you, they're all or nothing. For a specific preset of tables. And you can't really configure it per table. So if you're willing to do a little more setup, then I'd recommend using your own grants and roles. System tables support GRANT and REVOKE statements just like any regular relations. And in that case, I wouldn't even bother with SYSMONITOR or the metafunctions. So to do this, just grant whatever privileges you see fit to roles that you create. Then go ahead and grant those roles to the users that you want. And revoke access to the system tables of your choice from PUBLIC. If you need even finer-grained access than this, you can create views on top of system tables. For example, you can create a view on top of the user system table which only shows the current user's information, uses a built-in function that you can use as part of the view definition. And then, you can actually grant this to PUBLIC, so that each user in Vertica could see their own user's information and never give access to the user system table as a whole, just that view. Now if you're a superuser or if you have direct access to nodes in the cluster, filesystem/OS, et cetera, then you have more ways to see events. Vertica supports various methods of logging. You can see a few methods here which are generally outside of running Vertica, you'd interact with them in a different way, with the exception of active events which is a system table. We've also got the data collector. And that sorts events by subjects. So what the data collector does, it extends the logging and system table functionality, by the component, is what it's called in the documentation. And it logs these events and information to rotating files. For example, AnalyzeStatistics is a function that could be of use by users and as a database administrator, you might want to monitor that so you can use the data collector for AnalyzeStatistics. And the files that these create can be exported into a monitoring database. One example of that is with the Management Console Extended Monitoring. So check out their virtual BDC talk. The one on the management console. And that's it for the key points of security in Vertica. Well, many of these slides could spawn a talk on their own, so we encourage you to check out our blog, check out the documentation and the forum for further investigation and collaboration. Hopefully the information we provided today will inform your choices in securing your deployment of Vertica. Thanks for your time today. That concludes our presentation. Now, we're ready for Q&A.
SUMMARY :
in the question box below the slide as it occurs to you So for instance, you can see date of birth encrypted and the question you have to ask is more complex.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Chris | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Chris Morris | PERSON | 0.99+ |
second option | QUANTITY | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
Paige Roberts | PERSON | 0.99+ |
two types | QUANTITY | 0.99+ |
first option | QUANTITY | 0.99+ |
three | QUANTITY | 0.99+ |
Alice | PERSON | 0.99+ |
second approach | QUANTITY | 0.99+ |
Paige | PERSON | 0.99+ |
third option | QUANTITY | 0.99+ |
AWS' | ORGANIZATION | 0.99+ |
today | DATE | 0.99+ |
Today | DATE | 0.99+ |
first approach | QUANTITY | 0.99+ |
second half | QUANTITY | 0.99+ |
each service | QUANTITY | 0.99+ |
Bob | PERSON | 0.99+ |
10 petabytes | QUANTITY | 0.99+ |
Fenic | PERSON | 0.99+ |
first | QUANTITY | 0.99+ |
first source | QUANTITY | 0.99+ |
first one | QUANTITY | 0.99+ |
Fen | PERSON | 0.98+ |
S3 | TITLE | 0.98+ |
One system | QUANTITY | 0.98+ |
first objective | QUANTITY | 0.98+ |
each user | QUANTITY | 0.98+ |
First role | QUANTITY | 0.97+ |
each principal | QUANTITY | 0.97+ |
4/2 | DATE | 0.97+ |
each | QUANTITY | 0.97+ |
both | QUANTITY | 0.97+ |
Vertica | TITLE | 0.97+ |
First | QUANTITY | 0.97+ |
one | QUANTITY | 0.96+ |
this week | DATE | 0.95+ |
three different methods | QUANTITY | 0.95+ |
three system tables | QUANTITY | 0.94+ |
one thing | QUANTITY | 0.94+ |
Fenic Fawkes | PERSON | 0.94+ |
Parquet | TITLE | 0.94+ |
Hadoop | TITLE | 0.94+ |
One example | QUANTITY | 0.93+ |
Dbadmin | PERSON | 0.92+ |
10.0 | QUANTITY | 0.92+ |
UNLIST TILL 4/2 - Vertica Big Data Conference Keynote
>> Joy: Welcome to the Virtual Big Data Conference. Vertica is so excited to host this event. I'm Joy King, and I'll be your host for today's Big Data Conference Keynote Session. It's my honor and my genuine pleasure to lead Vertica's product and go-to-market strategy. And I'm so lucky to have a passionate and committed team who turned our Vertica BDC event, into a virtual event in a very short amount of time. I want to thank the thousands of people, and yes, that's our true number who have registered to attend this virtual event. We were determined to balance your health, safety and your peace of mind with the excitement of the Vertica BDC. This is a very unique event. Because as I hope you all know, we focus on engineering and architecture, best practice sharing and customer stories that will educate and inspire everyone. I also want to thank our top sponsors for the virtual BDC, Arrow, and Pure Storage. Our partnerships are so important to us and to everyone in the audience. Because together, we get things done faster and better. Now for today's keynote, you'll hear from three very important and energizing speakers. First, Colin Mahony, our SVP and General Manager for Vertica, will talk about the market trends that Vertica is betting on to win for our customers. And he'll share the exciting news about our Vertica 10 announcement and how this will benefit our customers. Then you'll hear from Amy Fowler, VP of strategy and solutions for FlashBlade at Pure Storage. Our partnership with Pure Storage is truly unique in the industry, because together modern infrastructure from Pure powers modern analytics from Vertica. And then you'll hear from John Yovanovich, Director of IT at AT&T, who will tell you about the Pure Vertica Symphony that plays live every day at AT&T. Here we go, Colin, over to you. >> Colin: Well, thanks a lot joy. And, I want to echo Joy's thanks to our sponsors, and so many of you who have helped make this happen. This is not an easy time for anyone. We were certainly looking forward to getting together in person in Boston during the Vertica Big Data Conference and Winning with Data. But I think all of you and our team have done a great job, scrambling and putting together a terrific virtual event. So really appreciate your time. I also want to remind people that we will make both the slides and the full recording available after this. So for any of those who weren't able to join live, that is still going to be available. Well, things have been pretty exciting here. And in the analytic space in general, certainly for Vertica, there's a lot happening. There are a lot of problems to solve, a lot of opportunities to make things better, and a lot of data that can really make every business stronger, more efficient, and frankly, more differentiated. For Vertica, though, we know that focusing on the challenges that we can directly address with our platform, and our people, and where we can actually make the biggest difference is where we ought to be putting our energy and our resources. I think one of the things that has made Vertica so strong over the years is our ability to focus on those areas where we can make a great difference. So for us as we look at the market, and we look at where we play, there are really three recent and some not so recent, but certainly picking up a lot of the market trends that have become critical for every industry that wants to Win Big With Data. We've heard this loud and clear from our customers and from the analysts that cover the market. If I were to summarize these three areas, this really is the core focus for us right now. We know that there's massive data growth. And if we can unify the data silos so that people can really take advantage of that data, we can make a huge difference. We know that public clouds offer tremendous advantages, but we also know that balance and flexibility is critical. And we all need the benefit that machine learning for all the types up to the end data science. We all need the benefits that they can bring to every single use case, but only if it can really be operationalized at scale, accurate and in real time. And the power of Vertica is, of course, how we're able to bring so many of these things together. Let me talk a little bit more about some of these trends. So one of the first industry trends that we've all been following probably now for over the last decade, is Hadoop and specifically HDFS. So many companies have invested, time, money, more importantly, people in leveraging the opportunity that HDFS brought to the market. HDFS is really part of a much broader storage disruption that we'll talk a little bit more about, more broadly than HDFS. But HDFS itself was really designed for petabytes of data, leveraging low cost commodity hardware and the ability to capture a wide variety of data formats, from a wide variety of data sources and applications. And I think what people really wanted, was to store that data before having to define exactly what structures they should go into. So over the last decade or so, the focus for most organizations is figuring out how to capture, store and frankly manage that data. And as a platform to do that, I think, Hadoop was pretty good. It certainly changed the way that a lot of enterprises think about their data and where it's locked up. In parallel with Hadoop, particularly over the last five years, Cloud Object Storage has also given every organization another option for collecting, storing and managing even more data. That has led to a huge growth in data storage, obviously, up on public clouds like Amazon and their S3, Google Cloud Storage and Azure Blob Storage just to name a few. And then when you consider regional and local object storage offered by cloud vendors all over the world, the explosion of that data, in leveraging this type of object storage is very real. And I think, as I mentioned, it's just part of this broader storage disruption that's been going on. But with all this growth in the data, in all these new places to put this data, every organization we talk to is facing even more challenges now around the data silo. Sure the data silos certainly getting bigger. And hopefully they're getting cheaper per bit. But as I said, the focus has really been on collecting, storing and managing the data. But between the new data lakes and many different cloud object storage combined with all sorts of data types from the complexity of managing all this, getting that business value has been very limited. This actually takes me to big bet number one for Team Vertica, which is to unify the data. Our goal, and some of the announcements we have made today plus roadmap announcements I'll share with you throughout this presentation. Our goal is to ensure that all the time, money and effort that has gone into storing that data, all the data turns into business value. So how are we going to do that? With a unified analytics platform that analyzes the data wherever it is HDFS, Cloud Object Storage, External tables in an any format ORC, Parquet, JSON, and of course, our own Native Roth Vertica format. Analyze the data in the right place in the right format, using a single unified tool. This is something that Vertica has always been committed to, and you'll see in some of our announcements today, we're just doubling down on that commitment. Let's talk a little bit more about the public cloud. This is certainly the second trend. It's the second wave maybe of data disruption with object storage. And there's a lot of advantages when it comes to public cloud. There's no question that the public clouds give rapid access to compute storage with the added benefit of eliminating data center maintenance that so many companies, want to get out of themselves. But maybe the biggest advantage that I see is the architectural innovation. The public clouds have introduced so many methodologies around how to provision quickly, separating compute and storage and really dialing-in the exact needs on demand, as you change workloads. When public clouds began, it made a lot of sense for the cloud providers and their customers to charge and pay for compute and storage in the ratio that each use case demanded. And I think you're seeing that trend, proliferate all over the place, not just up in public cloud. That architecture itself is really becoming the next generation architecture for on-premise data centers, as well. But there are a lot of concerns. I think we're all aware of them. They're out there many times for different workloads, there are higher costs. Especially if some of the workloads that are being run through analytics, which tend to run all the time. Just like some of the silo challenges that companies are facing with HDFS, data lakes and cloud storage, the public clouds have similar types of siloed challenges as well. Initially, there was a belief that they were cheaper than data centers, and when you added in all the costs, it looked that way. And again, for certain elastic workloads, that is the case. I don't think that's true across the board overall. Even to the point where a lot of the cloud vendors aren't just charging lower costs anymore. We hear from a lot of customers that they don't really want to tether themselves to any one cloud because of some of those uncertainties. Of course, security and privacy are a concern. We hear a lot of concerns with regards to cloud and even some SaaS vendors around shared data catalogs, across all the customers and not enough separation. But security concerns are out there, you can read about them. I'm not going to jump into that bandwagon. But we hear about them. And then, of course, I think one of the things we hear the most from our customers, is that each cloud stack is starting to feel even a lot more locked in than the traditional data warehouse appliance. And as everybody knows, the industry has been running away from appliances as fast as it can. And so they're not eager to get locked into another, quote, unquote, virtual appliance, if you will, up in the cloud. They really want to make sure they have flexibility in which clouds, they're going to today, tomorrow and in the future. And frankly, we hear from a lot of our customers that they're very interested in eventually mixing and matching, compute from one cloud with, say storage from another cloud, which I think is something that we'll hear a lot more about. And so for us, that's why we've got our big bet number two. we love the cloud. We love the public cloud. We love the private clouds on-premise, and other hosting providers. But our passion and commitment is for Vertica to be able to run in any of the clouds that our customers choose, and make it portable across those clouds. We have supported on-premises and all public clouds for years. And today, we have announced even more support for Vertica in Eon Mode, the deployment option that leverages the separation of compute from storage, with even more deployment choices, which I'm going to also touch more on as we go. So super excited about our big bet number two. And finally as I mentioned, for all the hype that there is around machine learning, I actually think that most importantly, this third trend that team Vertica is determined to address is the need to bring business critical, analytics, machine learning, data science projects into production. For so many years, there just wasn't enough data available to justify the investment in machine learning. Also, processing power was expensive, and storage was prohibitively expensive. But to train and score and evaluate all the different models to unlock the full power of predictive analytics was tough. Today you have those massive data volumes. You have the relatively cheap processing power and storage to make that dream a reality. And if you think about this, I mean with all the data that's available to every company, the real need is to operationalize the speed and the scale of machine learning so that these organizations can actually take advantage of it where they need to. I mean, we've seen this for years with Vertica, going back to some of the most advanced gaming companies in the early days, they were incorporating this with live data directly into their gaming experiences. Well, every organization wants to do that now. And the accuracy for clickability and real time actions are all key to separating the leaders from the rest of the pack in every industry when it comes to machine learning. But if you look at a lot of these projects, the reality is that there's a ton of buzz, there's a ton of hype spanning every acronym that you can imagine. But most companies are struggling, do the separate teams, different tools, silos and the limitation that many platforms are facing, driving, down sampling to get a small subset of the data, to try to create a model that then doesn't apply, or compromising accuracy and making it virtually impossible to replicate models, and understand decisions. And if there's one thing that we've learned when it comes to data, prescriptive data at the atomic level, being able to show end of one as we refer to it, meaning individually tailored data. No matter what it is healthcare, entertainment experiences, like gaming or other, being able to get at the granular data and make these decisions, make that scoring applies to machine learning just as much as it applies to giving somebody a next-best-offer. But the opportunity has never been greater. The need to integrate this end-to-end workflow and support the right tools without compromising on that accuracy. Think about it as no downsampling, using all the data, it really is key to machine learning success. Which should be no surprise then why the third big bet from Vertica is one that we've actually been working on for years. And we're so proud to be where we are today, helping the data disruptors across the world operationalize machine learning. This big bet has the potential to truly unlock, really the potential of machine learning. And today, we're announcing some very important new capabilities specifically focused on unifying the work being done by the data science community, with their preferred tools and platforms, and the volume of data and performance at scale, available in Vertica. Our strategy has been very consistent over the last several years. As I said in the beginning, we haven't deviated from our strategy. Of course, there's always things that we add. Most of the time, it's customer driven, it's based on what our customers are asking us to do. But I think we've also done a great job, not trying to be all things to all people. Especially as these hype cycles flare up around us, we absolutely love participating in these different areas without getting completely distracted. I mean, there's a variety of query tools and data warehouses and analytics platforms in the market. We all know that. There are tools and platforms that are offered by the public cloud vendors, by other vendors that support one or two specific clouds. There are appliance vendors, who I was referring to earlier who can deliver package data warehouse offerings for private data centers. And there's a ton of popular machine learning tools, languages and other kits. But Vertica is the only advanced analytic platform that can do all this, that can bring it together. We can analyze the data wherever it is, in HDFS, S3 Object Storage, or Vertica itself. Natively we support multiple clouds on-premise deployments, And maybe most importantly, we offer that choice of deployment modes to allow our customers to choose the architecture that works for them right now. It still also gives them the option to change move, evolve over time. And Vertica is the only analytics database with end-to-end machine learning that can truly operationalize ML at scale. And I know it's a mouthful. But it is not easy to do all these things. It is one of the things that highly differentiates Vertica from the rest of the pack. It is also why our customers, all of you continue to bet on us and see the value that we are delivering and we will continue to deliver. Here's a couple of examples of some of our customers who are powered by Vertica. It's the scale of data. It's the millisecond response times. Performance and scale have always been a huge part of what we have been about, not the only thing. I think the functionality all the capabilities that we add to the platform, the ease of use, the flexibility, obviously with the deployment. But if you look at some of the numbers they are under these customers on this slide. And I've shared a lot of different stories about these customers. Which, by the way, it still amaze me every time I talk to one and I get the updates, you can see the power and the difference that Vertica is making. Equally important, if you look at a lot of these customers, they are the epitome of being able to deploy Vertica in a lot of different environments. Many of the customers on this slide are not using Vertica just on-premise or just in the cloud. They're using it in a hybrid way. They're using it in multiple different clouds. And again, we've been with them on that journey throughout, which is what has made this product and frankly, our roadmap and our vision exactly what it is. It's been quite a journey. And that journey continues now with the Vertica 10 release. The Vertica 10 release is obviously a massive release for us. But if you look back, you can see that building on that native columnar architecture that started a long time ago, obviously, with the C-Store paper. We built it to leverage that commodity hardware, because it was an architecture that was never tightly integrated with any specific underlying infrastructure. I still remember hearing the initial pitch from Mike Stonebreaker, about the vision of Vertica as a software only solution and the importance of separating the company from hardware innovation. And at the time, Mike basically said to me, "there's so much R&D in innovation that's going to happen in hardware, we shouldn't bake hardware into our solution. We should do it in software, and we'll be able to take advantage of that hardware." And that is exactly what has happened. But one of the most recent innovations that we embraced with hardware is certainly that separation of compute and storage. As I said previously, the public cloud providers offered this next generation architecture, really to ensure that they can provide the customers exactly what they needed, more compute or more storage and charge for each, respectively. The separation of compute and storage, compute from storage is a major milestone in data center architectures. If you think about it, it's really not only a public cloud innovation, though. It fundamentally redefines the next generation data architecture for on-premise and for pretty much every way people are thinking about computing today. And that goes for software too. Object storage is an example of the cost effective means for storing data. And even more importantly, separating compute from storage for analytic workloads has a lot of advantages. Including the opportunity to manage much more dynamic, flexible workloads. And more importantly, truly isolate those workloads from others. And by the way, once you start having something that can truly isolate workloads, then you can have the conversations around autonomic computing, around setting up some nodes, some compute resources on the data that won't affect any of the other data to do some things on their own, maybe some self analytics, by the system, etc. A lot of things that many of you know we've already been exploring in terms of our own system data in the product. But it was May 2018, believe it or not, it seems like a long time ago where we first announced Eon Mode and I want to make something very clear, actually about Eon mode. It's a mode, it's a deployment option for Vertica customers. And I think this is another huge benefit that we don't talk about enough. But unlike a lot of vendors in the market who will dig you and charge you for every single add-on like hit-buy, you name it. You get this with the Vertica product. If you continue to pay support and maintenance, this comes with the upgrade. This comes as part of the new release. So any customer who owns or buys Vertica has the ability to set up either an Enterprise Mode or Eon Mode, which is a question I know that comes up sometimes. Our first announcement of Eon was obviously AWS customers, including the trade desk, AT&T. Most of whom will be speaking here later at the Virtual Big Data Conference. They saw a huge opportunity. Eon Mode, not only allowed Vertica to scale elastically with that specific compute and storage that was needed, but it really dramatically simplified database operations including things like workload balancing, node recovery, compute provisioning, etc. So one of the most popular functions is that ability to isolate the workloads and really allocate those resources without negatively affecting others. And even though traditional data warehouses, including Vertica Enterprise Mode have been able to do lots of different workload isolation, it's never been as strong as Eon Mode. Well, it certainly didn't take long for our customers to see that value across the board with Eon Mode. Not just up in the cloud, in partnership with one of our most valued partners and a platinum sponsor here. Joy mentioned at the beginning. We announced Vertica Eon Mode for Pure Storage FlashBlade in September 2019. And again, just to be clear, this is not a new product, it's one Vertica with yet more deployment options. With Pure Storage, Vertica in Eon mode is not limited in any way by variable cloud, network latency. The performance is actually amazing when you take the benefits of separate and compute from storage and you run it with a Pure environment on-premise. Vertica in Eon Mode has a super smart cache layer that we call the depot. It's a big part of our secret sauce around Eon mode. And combined with the power and performance of Pure's FlashBlade, Vertica became the industry's first advanced analytics platform that actually separates compute and storage for on-premises data centers. Something that a lot of our customers are already benefiting from, and we're super excited about it. But as I said, this is a journey. We don't stop, we're not going to stop. Our customers need the flexibility of multiple public clouds. So today with Vertica 10, we're super proud and excited to announce support for Vertica in Eon Mode on Google Cloud. This gives our customers the ability to use their Vertica licenses on Amazon AWS, on-premise with Pure Storage and on Google Cloud. Now, we were talking about HDFS and a lot of our customers who have invested quite a bit in HDFS as a place, especially to store data have been pushing us to support Eon Mode with HDFS. So as part of Vertica 10, we are also announcing support for Vertica in Eon Mode using HDFS as the communal storage. Vertica's own Roth format data can be stored in HDFS, and actually the full functionality of Vertica is complete analytics, geospatial pattern matching, time series, machine learning, everything that we have in there can be applied to this data. And on the same HDFS nodes, Vertica can actually also analyze data in ORC or Parquet format, using External tables. We can also execute joins between the Roth data the External table holds, which powers a much more comprehensive view. So again, it's that flexibility to be able to support our customers, wherever they need us to support them on whatever platform, they have. Vertica 10 gives us a lot more ways that we can deploy Eon Mode in various environments for our customers. It allows them to take advantage of Vertica in Eon Mode and the power that it brings with that separation, with that workload isolation, to whichever platform they are most comfortable with. Now, there's a lot that has come in Vertica 10. I'm definitely not going to be able to cover everything. But we also introduced complex types as an example. And complex data types fit very well into Eon as well in this separation. They significantly reduce the data pipeline, the cost of moving data between those, a much better support for unstructured data, which a lot of our customers have mixed with structured data, of course, and they leverage a lot of columnar execution that Vertica provides. So you get complex data types in Vertica now, a lot more data, stronger performance. It goes great with the announcement that we made with the broader Eon Mode. Let's talk a little bit more about machine learning. We've been actually doing work in and around machine learning with various extra regressions and a whole bunch of other algorithms for several years. We saw the huge advantage that MPP offered, not just as a sequel engine as a database, but for ML as well. Didn't take as long to realize that there's a lot more to operationalizing machine learning than just those algorithms. It's data preparation, it's that model trade training. It's the scoring, the shaping, the evaluation. That is so much of what machine learning and frankly, data science is about. You do know, everybody always wants to jump to the sexy algorithm and we handle those tasks very, very well. It makes Vertica a terrific platform to do that. A lot of work in data science and machine learning is done in other tools. I had mentioned that there's just so many tools out there. We want people to be able to take advantage of all that. We never believed we were going to be the best algorithm company or come up with the best models for people to use. So with Vertica 10, we support PMML. We can import now and export PMML models. It's a huge step for us around that operationalizing machine learning projects for our customers. Allowing the models to get built outside of Vertica yet be imported in and then applying to that full scale of data with all the performance that you would expect from Vertica. We also are more tightly integrating with Python. As many of you know, we've been doing a lot of open source projects with the community driven by many of our customers, like Uber. And so now with Python we've integrated with TensorFlow, allowing data scientists to build models in their preferred language, to take advantage of TensorFlow. But again, to store and deploy those models at scale with Vertica. I think both these announcements are proof of our big bet number three, and really our commitment to supporting innovation throughout the community by operationalizing ML with that accuracy, performance and scale of Vertica for our customers. Again, there's a lot of steps when it comes to the workflow of machine learning. These are some of them that you can see on the slide, and it's definitely not linear either. We see this as a circle. And companies that do it, well just continue to learn, they continue to rescore, they continue to redeploy and they want to operationalize all that within a single platform that can take advantage of all those capabilities. And that is the platform, with a very robust ecosystem that Vertica has always been committed to as an organization and will continue to be. This graphic, many of you have seen it evolve over the years. Frankly, if we put everything and everyone on here wouldn't fit on a slide. But it will absolutely continue to evolve and grow as we support our customers, where they need the support most. So, again, being able to deploy everywhere, being able to take advantage of Vertica, not just as a business analyst or a business user, but as a data scientists or as an operational or BI person. We want Vertica to be leveraged and used by the broader organization. So I think it's fair to say and I encourage everybody to learn more about Vertica 10, because I'm just highlighting some of the bigger aspects of it. But we talked about those three market trends. The need to unify the silos, the need for hybrid multiple cloud deployment options, the need to operationalize business critical machine learning projects. Vertica 10 has absolutely delivered on those. But again, we are not going to stop. It is our job not to, and this is how Team Vertica thrives. I always joke that the next release is the best release. And, of course, even after Vertica 10, that is also true, although Vertica 10 is pretty awesome. But, you know, from the first line of code, we've always been focused on performance and scale, right. And like any really strong data platform, the execution engine, the optimizer and the execution engine are the two core pieces of that. Beyond Vertica 10, some of the big things that we're already working on, next generation execution engine. We're already actually seeing incredible early performance from this. And this is just one example, of how important it is for an organization like Vertica to constantly go back and re-innovate. Every single release, we do the sit ups and crunches, our performance and scale. How do we improve? And there's so many parts of the core server, there's so many parts of our broader ecosystem. We are constantly looking at coverages of how we can go back to all the code lines that we have, and make them better in the current environment. And it's not an easy thing to do when you're doing that, and you're also expanding in the environment that we are expanding into to take advantage of the different deployments, which is a great segue to this slide. Because if you think about today, we're obviously already available with Eon Mode and Amazon, AWS and Pure and actually MinIO as well. As I talked about in Vertica 10 we're adding Google and HDFS. And coming next, obviously, Microsoft Azure, Alibaba cloud. So being able to expand into more of these environments is really important for the Vertica team and how we go forward. And it's not just running in these clouds, for us, we want it to be a SaaS like experience in all these clouds. We want you to be able to deploy Vertica in 15 minutes or less on these clouds. You can also consume Vertica, in a lot of different ways, on these clouds. As an example, in Amazon Vertica by the Hour. So for us, it's not just about running, it's about taking advantage of the ecosystems that all these cloud providers offer, and really optimizing the Vertica experience as part of them. Optimization, around automation, around self service capabilities, extending our management console, we now have products that like the Vertica Advisor Tool that our Customer Success Team has created to actually use our own smarts in Vertica. To take data from customers that give it to us and help them tune automatically their environment. You can imagine that we're taking that to the next level, in a lot of different endeavors that we're doing around how Vertica as a product can actually be smarter because we all know that simplicity is key. There just aren't enough people in the world who are good at managing data and taking it to the next level. And of course, other things that we all hear about, whether it's Kubernetes and containerization. You can imagine that that probably works very well with the Eon Mode and separating compute and storage. But innovation happens everywhere. We innovate around our community documentation. Many of you have taken advantage of the Vertica Academy. The numbers there are through the roof in terms of the number of people coming in and certifying on it. So there's a lot of things that are within the core products. There's a lot of activity and action beyond the core products that we're taking advantage of. And let's not forget why we're here, right? It's easy to talk about a platform, a data platform, it's easy to jump into all the functionality, the analytics, the flexibility, how we can offer it. But at the end of the day, somebody, a person, she's got to take advantage of this data, she's got to be able to take this data and use this information to make a critical business decision. And that doesn't happen unless we explore lots of different and frankly, new ways to get that predictive analytics UI and interface beyond just the standard BI tools in front of her at the right time. And so there's a lot of activity, I'll tease you with that going on in this organization right now about how we can do that and deliver that for our customers. We're in a great position to be able to see exactly how this data is consumed and used and start with this core platform that we have to go out. Look, I know, the plan wasn't to do this as a virtual BDC. But I really appreciate you tuning in. Really appreciate your support. I think if there's any silver lining to us, maybe not being able to do this in person, it's the fact that the reach has actually gone significantly higher than what we would have been able to do in person in Boston. We're certainly looking forward to doing a Big Data Conference in the future. But if I could leave you with anything, know this, since that first release for Vertica, and our very first customers, we have been very consistent. We respect all the innovation around us, whether it's open source or not. We understand the market trends. We embrace those new ideas and technologies and for us true north, and the most important thing is what does our customer need to do? What problem are they trying to solve? And how do we use the advantages that we have without disrupting our customers? But knowing that you depend on us to deliver that unified analytics strategy, it will deliver that performance of scale, not only today, but tomorrow and for years to come. We've added a lot of great features to Vertica. I think we've said no to a lot of things, frankly, that we just knew we wouldn't be the best company to deliver. When we say we're going to do things we do them. Vertica 10 is a perfect example of so many of those things that we from you, our customers have heard loud and clear, and we have delivered. I am incredibly proud of this team across the board. I think the culture of Vertica, a customer first culture, jumping in to help our customers win no matter what is also something that sets us massively apart. I hear horror stories about support experiences with other organizations. And people always seem to be amazed at Team Vertica's willingness to jump in or their aptitude for certain technical capabilities or understanding the business. And I think sometimes we take that for granted. But that is the team that we have as Team Vertica. We are incredibly excited about Vertica 10. I think you're going to love the Virtual Big Data Conference this year. I encourage you to tune in. Maybe one other benefit is I know some people were worried about not being able to see different sessions because they were going to overlap with each other well now, even if you can't do it live, you'll be able to do those sessions on demand. Please enjoy the Vertica Big Data Conference here in 2020. Please you and your families and your co-workers be safe during these times. I know we will get through it. And analytics is probably going to help with a lot of that and we already know it is helping in many different ways. So believe in the data, believe in data's ability to change the world for the better. And thank you for your time. And with that, I am delighted to now introduce Micro Focus CEO Stephen Murdoch to the Vertica Big Data Virtual Conference. Thank you Stephen. >> Stephen: Hi, everyone, my name is Stephen Murdoch. I have the pleasure and privilege of being the Chief Executive Officer here at Micro Focus. Please let me add my welcome to the Big Data Conference. And also my thanks for your support, as we've had to pivot to this being virtual rather than a physical conference. Its amazing how quickly we all reset to a new normal. I certainly didn't expect to be addressing you from my study. Vertica is an incredibly important part of Micro Focus family. Is key to our goal of trying to enable and help customers become much more data driven across all of their IT operations. Vertica 10 is a huge step forward, we believe. It allows for multi-cloud innovation, genuinely hybrid deployments, begin to leverage machine learning properly in the enterprise, and also allows the opportunity to unify currently siloed lakes of information. We operate in a very noisy, very competitive market, and there are people, who are in that market who can do some of those things. The reason we are so excited about Vertica is we genuinely believe that we are the best at doing all of those things. And that's why we've announced publicly, you're under executing internally, incremental investment into Vertica. That investments targeted at accelerating the roadmaps that already exist. And getting that innovation into your hands faster. This idea is speed is key. It's not a question of if companies have to become data driven organizations, it's a question of when. So that speed now is really important. And that's why we believe that the Big Data Conference gives a great opportunity for you to accelerate your own plans. You will have the opportunity to talk to some of our best architects, some of the best development brains that we have. But more importantly, you'll also get to hear from some of our phenomenal Roth Data customers. You'll hear from Uber, from the Trade Desk, from Philips, and from AT&T, as well as many many others. And just hearing how those customers are using the power of Vertica to accelerate their own, I think is the highlight. And I encourage you to use this opportunity to its full. Let me close by, again saying thank you, we genuinely hope that you get as much from this virtual conference as you could have from a physical conference. And we look forward to your engagement, and we look forward to hearing your feedback. With that, thank you very much. >> Joy: Thank you so much, Stephen, for joining us for the Vertica Big Data Conference. Your support and enthusiasm for Vertica is so clear, and it makes a big difference. Now, I'm delighted to introduce Amy Fowler, the VP of Strategy and Solutions for FlashBlade at Pure Storage, who was one of our BDC Platinum Sponsors, and one of our most valued partners. It was a proud moment for me, when we announced Vertica in Eon mode for Pure Storage FlashBlade and we became the first analytics data warehouse that separates compute from storage for on-premise data centers. Thank you so much, Amy, for joining us. Let's get started. >> Amy: Well, thank you, Joy so much for having us. And thank you all for joining us today, virtually, as we may all be. So, as we just heard from Colin Mahony, there are some really interesting trends that are happening right now in the big data analytics market. From the end of the Hadoop hype cycle, to the new cloud reality, and even the opportunity to help the many data science and machine learning projects move from labs to production. So let's talk about these trends in the context of infrastructure. And in particular, look at why a modern storage platform is relevant as organizations take on the challenges and opportunities associated with these trends. The answer is the Hadoop hype cycles left a lot of data in HDFS data lakes, or reservoirs or swamps depending upon the level of the data hygiene. But without the ability to get the value that was promised from Hadoop as a platform rather than a distributed file store. And when we combine that data with the massive volume of data in Cloud Object Storage, we find ourselves with a lot of data and a lot of silos, but without a way to unify that data and find value in it. Now when you look at the infrastructure data lakes are traditionally built on, it is often direct attached storage or data. The approach that Hadoop took when it entered the market was primarily bound by the limits of networking and storage technologies. One gig ethernet and slower spinning disk. But today, those barriers do not exist. And all FlashStorage has fundamentally transformed how data is accessed, managed and leveraged. The need for local data storage for significant volumes of data has been largely mitigated by the performance increases afforded by all Flash. At the same time, organizations can achieve superior economies of scale with that segregation of compute and storage. With compute and storage, you don't always scale in lockstep. Would you want to add an engine to the train every time you add another boxcar? Probably not. But from a Pure Storage perspective, FlashBlade is uniquely architected to allow customers to achieve better resource utilization for compute and storage, while at the same time, reducing complexity that has arisen from the siloed nature of the original big data solutions. The second and equally important recent trend we see is something I'll call cloud reality. The public clouds made a lot of promises and some of those promises were delivered. But cloud economics, especially usage based and elastic scaling, without the control that many companies need to manage the financial impact is causing a lot of issues. In addition, the risk of vendor lock-in from data egress, charges, to integrated software stacks that can't be moved or deployed on-premise is causing a lot of organizations to back off the all the way non-cloud strategy, and move toward hybrid deployments. Which is kind of funny in a way because it wasn't that long ago that there was a lot of talk about no more data centers. And for example, one large retailer, I won't name them, but I'll admit they are my favorites. They several years ago told us they were completely done with on-prem storage infrastructure, because they were going 100% to the cloud. But they just deployed FlashBlade for their data pipelines, because they need predictable performance at scale. And the all cloud TCO just didn't add up. Now, that being said, well, there are certainly challenges with the public cloud. It has also brought some things to the table that we see most organizations wanting. First of all, in a lot of cases applications have been built to leverage object storage platforms like S3. So they need that object protocol, but they may also need it to be fast. And the said object may be oxymoron only a few years ago, and this is an area of the market where Pure and FlashBlade have really taken a leadership position. Second, regardless of where the data is physically stored, organizations want the best elements of a cloud experience. And for us, that means two main things. Number one is simplicity and ease of use. If you need a bunch of storage experts to run the system, that should be considered a bug. The other big one is the consumption model. The ability to pay for what you need when you need it, and seamlessly grow your environment over time totally nondestructively. This is actually pretty huge and something that a lot of vendors try to solve for with finance programs. But no finance program can address the pain of a forklift upgrade, when you need to move to next gen hardware. To scale nondestructively over long periods of time, five to 10 years plus is a crucial architectural decisions need to be made at the outset. Plus, you need the ability to pay as you use it. And we offer something for FlashBlade called Pure as a Service, which delivers exactly that. The third cloud characteristic that many organizations want is the option for hybrid. Even if that is just a DR site in the cloud. In our case, that means supporting appplication of S3, at the AWS. And the final trend, which to me represents the biggest opportunity for all of us, is the need to help the many data science and machine learning projects move from labs to production. This means bringing all the machine learning functions and model training to the data, rather than moving samples or segments of data to separate platforms. As we all know, machine learning needs a ton of data for accuracy. And there is just too much data to retrieve from the cloud for every training job. At the same time, predictive analytics without accuracy is not going to deliver the business advantage that everyone is seeking. You can kind of visualize data analytics as it is traditionally deployed as being on a continuum. With that thing, we've been doing the longest, data warehousing on one end, and AI on the other end. But the way this manifests in most environments is a series of silos that get built up. So data is duplicated across all kinds of bespoke analytics and AI, environments and infrastructure. This creates an expensive and complex environment. So historically, there was no other way to do it because some level of performance is always table stakes. And each of these parts of the data pipeline has a different workload profile. A single platform to deliver on the multi dimensional performances, diverse set of applications required, that didn't exist three years ago. And that's why the application vendors pointed you towards bespoke things like DAS environments that we talked about earlier. And the fact that better options exists today is why we're seeing them move towards supporting this disaggregation of compute and storage. And when it comes to a platform that is a better option, one with a modern architecture that can address the diverse performance requirements of this continuum, and allow organizations to bring a model to the data instead of creating separate silos. That's exactly what FlashBlade is built for. Small files, large files, high throughput, low latency and scale to petabytes in a single namespace. And this is importantly a single rapid space is what we're focused on delivering for our customers. At Pure, we talk about it in the context of modern data experience because at the end of the day, that's what it's really all about. The experience for your teams in your organization. And together Pure Storage and Vertica have delivered that experience to a wide range of customers. From a SaaS analytics company, which uses Vertica on FlashBlade to authenticate the quality of digital media in real time, to a multinational car company, which uses Vertica on FlashBlade to make thousands of decisions per second for autonomous cars, or a healthcare organization, which uses Vertica on FlashBlade to enable healthcare providers to make real time decisions that impact lives. And I'm sure you're all looking forward to hearing from John Yavanovich from AT&T. To hear how he's been doing this with Vertica and FlashBlade as well. He's coming up soon. We have been really excited to build this partnership with Vertica. And we're proud to provide the only on-premise storage platform validated with Vertica Eon Mode. And deliver this modern data experience to our customers together. Thank you all so much for joining us today. >> Joy: Amy, thank you so much for your time and your insights. Modern infrastructure is key to modern analytics, especially as organizations leverage next generation data center architectures, and object storage for their on-premise data centers. Now, I'm delighted to introduce our last speaker in our Vertica Big Data Conference Keynote, John Yovanovich, Director of IT for AT&T. Vertica is so proud to serve AT&T, and especially proud of the harmonious impact we are having in partnership with Pure Storage. John, welcome to the Virtual Vertica BDC. >> John: Thank you joy. It's a pleasure to be here. And I'm excited to go through this presentation today. And in a unique fashion today 'cause as I was thinking through how I wanted to present the partnership that we have formed together between Pure Storage, Vertica and AT&T, I want to emphasize how well we all work together and how these three components have really driven home, my desire for a harmonious to use your word relationship. So, I'm going to move forward here and with. So here, what I'm going to do the theme of today's presentation is the Pure Vertica Symphony live at AT&T. And if anybody is a Westworld fan, you can appreciate the sheet music on the right hand side. What we're going to what I'm going to highlight here is in a musical fashion, is how we at AT&T leverage these technologies to save money to deliver a more efficient platform, and to actually just to make our customers happier overall. So as we look back, and back as early as just maybe a few years ago here at AT&T, I realized that we had many musicians to help the company. Or maybe you might want to call them data scientists, or data analysts. For the theme we'll stay with musicians. None of them were singing or playing from the same hymn book or sheet music. And so what we had was many organizations chasing a similar dream, but not exactly the same dream. And, best way to describe that is and I think with a lot of people this might resonate in your organizations. How many organizations are chasing a customer 360 view in your company? Well, I can tell you that I have at least four in my company. And I'm sure there are many that I don't know of. That is our problem because what we see is a repetitive sourcing of data. We see a repetitive copying of data. And there's just so much money to be spent. This is where I asked Pure Storage and Vertica to help me solve that problem with their technologies. What I also noticed was that there was no coordination between these departments. In fact, if you look here, nobody really wants to play with finance. Sales, marketing and care, sure that you all copied each other's data. But they actually didn't communicate with each other as they were copying the data. So the data became replicated and out of sync. This is a challenge throughout, not just my company, but all companies across the world. And that is, the more we replicate the data, the more problems we have at chasing or conquering the goal of single version of truth. In fact, I kid that I think that AT&T, we actually have adopted the multiple versions of truth, techno theory, which is not where we want to be, but this is where we are. But we are conquering that with the synergies between Pure Storage and Vertica. This is what it leaves us with. And this is where we are challenged and that if each one of our siloed business units had their own stories, their own dedicated stories, and some of them had more money than others so they bought more storage. Some of them anticipating storing more data, and then they really did. Others are running out of space, but can't put anymore because their bodies aren't been replenished. So if you look at it from this side view here, we have a limited amount of compute or fixed compute dedicated to each one of these silos. And that's because of the, wanting to own your own. And the other part is that you are limited or wasting space, depending on where you are in the organization. So there were the synergies aren't just about the data, but actually the compute and the storage. And I wanted to tackle that challenge as well. So I was tackling the data. I was tackling the storage, and I was tackling the compute all at the same time. So my ask across the company was can we just please play together okay. And to do that, I knew that I wasn't going to tackle this by getting everybody in the same room and getting them to agree that we needed one account table, because they will argue about whose account table is the best account table. But I knew that if I brought the account tables together, they would soon see that they had so much redundancy that I can now start retiring data sources. I also knew that if I brought all the compute together, that they would all be happy. But I didn't want them to tackle across tackle each other. And in fact that was one of the things that all business units really enjoy. Is they enjoy the silo of having their own compute, and more or less being able to control their own destiny. Well, Vertica's subclustering allows just that. And this is exactly what I was hoping for, and I'm glad they've brought through. And finally, how did I solve the problem of the single account table? Well when you don't have dedicated storage, and you can separate compute and storage as Vertica in Eon Mode does. And we store the data on FlashBlades, which you see on the left and right hand side, of our container, which I can describe in a moment. Okay, so what we have here, is we have a container full of compute with all the Vertica nodes sitting in the middle. Two loader, we'll call them loader subclusters, sitting on the sides, which are dedicated to just putting data onto the FlashBlades, which is sitting on both ends of the container. Now today, I have two dedicated storage or common dedicated might not be the right word, but two storage racks one on the left one on the right. And I treat them as separate storage racks. They could be one, but i created them separately for disaster recovery purposes, lashing work in case that rack were to go down. But that being said, there's no reason why I'm probably going to add a couple of them here in the future. So I can just have a, say five to 10, petabyte storage, setup, and I'll have my DR in another 'cause the DR shouldn't be in the same container. Okay, but I'll DR outside of this container. So I got them all together, I leveraged subclustering, I leveraged separate and compute. I was able to convince many of my clients that they didn't need their own account table, that they were better off having one. I eliminated, I reduced latency, I reduced our ticketing I reduce our data quality issues AKA ticketing okay. I was able to expand. What is this? As work. I was able to leverage elasticity within this cluster. As you can see, there are racks and racks of compute. We set up what we'll call the fixed capacity that each of the business units needed. And then I'm able to ramp up and release the compute that's necessary for each one of my clients based on their workloads throughout the day. And so while they compute to the right before you see that the instruments have already like, more or less, dedicated themselves towards all those are free for anybody to use. So in essence, what I have, is I have a concert hall with a lot of seats available. So if I want to run a 10 chair Symphony or 80, chairs, Symphony, I'm able to do that. And all the while, I can also do the same with my loader nodes. I can expand my loader nodes, to actually have their own Symphony or write all to themselves and not compete with any other workloads of the other clusters. What does that change for our organization? Well, it really changes the way our database administrators actually do their jobs. This has been a big transformation for them. They have actually become data conductors. Maybe you might even call them composers, which is interesting, because what I've asked them to do is morph into less technology and more workload analysis. And in doing so we're able to write auto-detect scripts, that watch the queues, watch the workloads so that we can help ramp up and trim down the cluster and subclusters as necessary. There has been an exciting transformation for our DBAs, who I need to now classify as something maybe like DCAs. I don't know, I have to work with HR on that. But I think it's an exciting future for their careers. And if we bring it all together, If we bring it all together, and then our clusters, start looking like this. Where everything is moving in harmonious, we have lots of seats open for extra musicians. And we are able to emulate a cloud experience on-prem. And so, I want you to sit back and enjoy the Pure Vertica Symphony live at AT&T. (soft music) >> Joy: Thank you so much, John, for an informative and very creative look at the benefits that AT&T is getting from its Pure Vertica symphony. I do really like the idea of engaging HR to change the title to Data Conductor. That's fantastic. I've always believed that music brings people together. And now it's clear that analytics at AT&T is part of that musical advantage. So, now it's time for a short break. And we'll be back for our breakout sessions, beginning at 12 pm Eastern Daylight Time. We have some really exciting sessions planned later today. And then again, as you can see on Wednesday. Now because all of you are already logged in and listening to this keynote, you already know the steps to continue to participate in the sessions that are listed here and on the previous slide. In addition, everyone received an email yesterday, today, and you'll get another one tomorrow, outlining the simple steps to register, login and choose your session. If you have any questions, check out the emails or go to www.vertica.com/bdc2020 for the logistics information. There are a lot of choices and that's always a good thing. Don't worry if you want to attend one or more or can't listen to these live sessions due to your timezone. All the sessions, including the Q&A sections will be available on demand and everyone will have access to the recordings as well as even more pre-recorded sessions that we'll post to the BDC website. Now I do want to leave you with two other important sites. First, our Vertica Academy. Vertica Academy is available to everyone. And there's a variety of very technical, self-paced, on-demand training, virtual instructor-led workshops, and Vertica Essentials Certification. And it's all free. Because we believe that Vertica expertise, helps everyone accelerate their Vertica projects and the advantage that those projects deliver. Now, if you have questions or want to engage with our Vertica engineering team now, we're waiting for you on the Vertica forum. We'll answer any questions or discuss any ideas that you might have. Thank you again for joining the Vertica Big Data Conference Keynote Session. Enjoy the rest of the BDC because there's a lot more to come
SUMMARY :
And he'll share the exciting news And that is the platform, with a very robust ecosystem some of the best development brains that we have. the VP of Strategy and Solutions is causing a lot of organizations to back off the and especially proud of the harmonious impact And that is, the more we replicate the data, Enjoy the rest of the BDC because there's a lot more to come
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Stephen | PERSON | 0.99+ |
Amy Fowler | PERSON | 0.99+ |
Mike | PERSON | 0.99+ |
John Yavanovich | PERSON | 0.99+ |
Amy | PERSON | 0.99+ |
Colin Mahony | PERSON | 0.99+ |
AT&T | ORGANIZATION | 0.99+ |
Boston | LOCATION | 0.99+ |
John Yovanovich | PERSON | 0.99+ |
Vertica | ORGANIZATION | 0.99+ |
Joy King | PERSON | 0.99+ |
Mike Stonebreaker | PERSON | 0.99+ |
John | PERSON | 0.99+ |
May 2018 | DATE | 0.99+ |
100% | QUANTITY | 0.99+ |
Wednesday | DATE | 0.99+ |
Colin | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Vertica Academy | ORGANIZATION | 0.99+ |
five | QUANTITY | 0.99+ |
Joy | PERSON | 0.99+ |
2020 | DATE | 0.99+ |
two | QUANTITY | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
Stephen Murdoch | PERSON | 0.99+ |
Vertica 10 | TITLE | 0.99+ |
Pure Storage | ORGANIZATION | 0.99+ |
one | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
Philips | ORGANIZATION | 0.99+ |
tomorrow | DATE | 0.99+ |
AT&T. | ORGANIZATION | 0.99+ |
September 2019 | DATE | 0.99+ |
Python | TITLE | 0.99+ |
www.vertica.com/bdc2020 | OTHER | 0.99+ |
One gig | QUANTITY | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Second | QUANTITY | 0.99+ |
First | QUANTITY | 0.99+ |
15 minutes | QUANTITY | 0.99+ |
yesterday | DATE | 0.99+ |
UNLIST TILL 4/1 - Putting Complex Data Types to Work
hello everybody thank you for joining us today from the virtual verdict of BBC 2020 today's breakout session is entitled putting complex data types to work I'm Jeff Healey I lead vertical marketing I'll be a host for this breakout session joining me is Deepak Magette II technical lead from verdict engineering but before we begin I encourage you to submit questions and comments during the virtual session you don't have to wait just type your question or comment and the question box below the slides and click Submit it won't be a Q&A session at the end of the presentation we'll answer as many questions were able to during that time any questions we don't address we'll do our best to answer them offline alternatively visit Vertica forms that formed up Vertica calm to post your questions there after the session engineering team is planning to join the forms conversation going and also as a reminder that you can maximize your screen by clicking a double arrow button in the lower right corner of the slides yes this virtual session is being recorded and will be available to view on demand this week we'll send you a notification as submits ready now let's get started over to you Deepak thanks yes make sure you talk about the complex a textbook they've been doing it wedeck R&D without further delay let's see why and how we should put completely aside to work in your data analytics so this is going to be the outline or overview of my talk today first I'm going to talk about what are complex data types in some use cases I will then quickly cover some file formats that support these complex website I will then deep dive into the current support for complex data types in America finally I'll conclude with some usage considerations and what is coming in are 1000 release and our future roadmap and directions for this project so what are complex stereotypes complex data types are nested data structures composed of tentative types community types are nothing but your int float and string war binary etc the basic types some examples of complex data types include struct also called row are a list set map and Union composite types can also be built by composing other complicated types computer types are very useful for handling sparse data we also make samples on this presentation on that use case and also they help simplify analysis so let's look at some examples of complex data types so the first example on the left you can see a simple customer which is of type struc with two fields namely make a field name of type string and field ID of type integer structs are nothing but a group of fields and each field is a type of its own the type can be primitive or another complex type and on the right we have some example data for this simple customer complex type so it's basically two fields of type string and integer so in this case you have two rows where the first row is Alex with name named Alex and ID 1 0 and the second row has name Mary with ID 2 0 0 2 the second complex type on the left is phone numbers of type array of data has the element type string so area is nothing but a collection of elements the elements could be again a primitive type or another complex type so in this example the collection is of type string which is a primitive type and on the right you have some example of this collection of a fairy type called phone numbers and basically each row has a set or the list or a collection of phone numbers on the first we have two phone numbers and second you have a single phone number in that array and the third type on the slide is the map data type map is nothing but a collection of key value pairs so each element is actually a key value and you have a collection of such elements the key is usually a primitive type however the value is can be a primitive or complex type so in this example the both the key and value are of type string and then if you look on the right side of the slide you have some sample data here we have HTTP requests where the key is the header type and the value is the header value so the for instance on the first row we have a key type pragma with value no cash key type host with value some hostname and similarly on the second row you have some key value called accept with some text HTML because yeah they actually have a collection of elements allison maps are commonly called as collections as a to talking to in mini documents so we saw examples of a one-level complex steps on this slide we have nested complex there types on the right we have the root complex site called web events of type struct script has a for field a session ID of type integer session duration of type timestamp and then the third and the fourth fields customer and history requests are further complex types themselves so customer is again a complex type of type struct with three fields where the first two fields name ID are primitive types however the third field is another complex type phone numbers which we just saw in the previous slide similarly history request is also the same map type that we just saw so in this example each complex types is independent and you can reuse a complex type inside other complex types for example you can build another type called orders and simply reuse the customer type however in a practical implementation you have to deal with complexities involving security ownership and like sets lifecycle dependencies so keeping complex types as independent has that advantage of reusing them however the complication with that is you have to deal with security and ownership and lifecycle dependencies so this is on this slide we have another style of declaring a nested complex type do is call inlined complex data type so we have the same web driven struct type however if you look at the complex sites that embedded into the parent type definition so customer and HTTP request definition is embedded in lined into this parent structure so the advantage of this is you won't have to deal with the security and other lifecycle dependency issues but with the downside being you can't reuse them so it's sort of a trade-off between the these two so so let's see now some use cases of these complex types so the first use case or the benefit of using complex stereotypes is that you'll be able to express analysis mode naturally compute I've simplified the expression of analysis logic thereby simplifying the data pipelines in sequel it feels as if you have tables inside table so let's look at an example on and say you want to list all the customers with more than one thousand website events so if you have complex types you can simply create a table called web events and with one column of type web even which is a complex step so we just saw that difference it has four fields station customer and HTTP request so you can basically have the entire schema or in one type if you don't have complex types you'll have to create four tables one essentially for each complex type and then you have to establish primary key foreign key dependencies across these tables now if you want to achieve your goal of of listing all the customers in more than thousand web requests if you have complex types you can simply use the dot notation to extract the name the contact and also use some special functions for maps that will give you the count of all the HTTP requests grid in thousand however if you don't have complex types you'll have to now join each table individually extract the results from sub query and again joined on the outer query and finally you can apply a predicate of total requests which are greater than thousand to basically get your final result so it's a complex steps basically simplify the query writing part also the execution itself is also simplified so you don't have to have joins if you have complex you can simply have a load step to load the map type and then you can apply the function on top of it directly however if you have separate tables you have to join all these data and apply the filter step and then finally another joint to get your results alright so the other advantage of complex types is that you can cross this semi structured data very efficiently for example if you have data from clique streams or page views the data is often sparse and maps are very well suited for such data so maps or semi-structured by nature and with this support you can now actually have semi structured data represented along with structured columns in in any database so maps have this nice of nice feature to cap encapsulated sparse data as an example the common fields of a kick stream click stream or page view data are pragma host and except if you don't have map types you will have to end up creating a column for each of this header or field types however if you have map you can basically embed as key value pairs for all the data so on the left here on the slide you can see an example where you have a separate column for each field you end up with a lot of nodes basically the sparse however if you can embed them into in a map you can put them into a single column and sort of yeah have better efficiency and better representation of spots they imagine if you have thousands of fields in a click stream or page view you will have thousands of columns you will need thousands of columns represent data if you don't have a map type correct so given these are the most commonly used complexity types let's see what are the file formats that actually support these complex data types so most of file formats popular ones support complex data types however they have different serve variations so for instance if you have JSON it supports arrays and objects which are complex data types however JSON data is schema-less it is row oriented and this text fits because it is Kimmel s it has to store it in encase on every job the second type of file format is Avro and Avro has records enums arrays Maps unions and a fixed type however Avro has a schema it is oriented and it is binary compressed the third category is basically the park' and our style of file formats where the columnar so parquet and arc have support for arrays maps and structs the hewa schema they are column-oriented unlike Avro which is oriented and they're also binary compressed and they support a very nice compression and encoding types additionally so the main difference between parquet and arc is only in terms of how they represent complex types parquet includes the complex type hierarchy as reputation deflation levels however orc uses a separate column at every parent of the complex type to basically the prisons are now less so that apart from that difference in how they represent complex types parking hogs have similar capabilities in terms of optimizations and other compression techniques so to summarize JSON has no schema has no binary format in this columnar so it is not columnar Avro has a schema because binary format however it is not columnar and parquet and art are have a schema have a binary format and are columnar so let's see how we can query these different kinds of complex types and also the different file formats that they can be present in in how we can basically query these different variations in Vertica so in Vertica we basically have this feature called flex tables to where you can load complex data types and analyze them so flex tables use a binary format called vemma to store data as key value pairs clicks tables are schema-less they are weak typed and they trade flexibility for performance so when I mean what I mean by schema-less is basically the keys provide the field name and each row can potentially have different keys and it is weak type because there's no type information at the column level we have some we will see some examples of of this week type in the following slides but basically there's no type information so so the data is stored in text format and because of the week type and schema-less nature of flex tables you can implement some optimum use cases like if you can trivially implement needs like schema evolution or keep the complex types types fluid if that is your use case then the weak tightness and schema-less nature of flex tables will help you a lot to get give you that flexibility however because you have this weak type you you have a downside of not getting the best possible performance so if you if your use case is to get the best possible performance you can use a new feature of the strongly-typed complex types that we started to introduce in Vertica so complex types here are basically a strongly typed complex types they have a schema and then they give you the best possible performance because the optimizer now has enough information from the schema and the type to implement optimization system column selection or all the nice techniques that Vertica employs to give you the best possible color performance can now be supported even for complex types so and we'll see some of the examples of these two types in these slides now so let's use a simple data called restaurants a restaurant data - as running throughout this poll excites to basically see all the different variations of flex and complex steps so on this slide you have some sample data with four fields and essentially two rows if you sort of loaded in if you just operate them out so the four fields are named cuisine locations in menu name in cuisine or of type watch are locations is essentially an array and menu array of a row of two fields item and price so if you the data is in JSON there is no schema and there is no type information so how do we process that in Vertica so in Vertica you can simply create a flex table called restaurants you can copy the restaurant dot J's the restaurants of JSON file into Vertica and basically you can now start analyzing the data so if you do a select star from restaurants you will see that all the data is actually in one column called draw and it also you have the other column called identity which is to give you some unique row row ID but the row column base again encapsulates all the data that gives in the restaurant so JSON file this tall column is nothing but the V map format the V map format is a binary format that encodes the data as key value pairs and RAW format is basically backed by the long word binary column type in Vertica so each key essentially gives you the field name and the values the field value and it's all in its however the values are in the text text representation so see now you want to get better performance of this JSON data flex tables has these nice functions to basically analyze your data or try to extract some schema and type information from your data so if you execute compute flex table keys on the restaurants table you will see a new table called public dot restaurants underscore keys and then that will give you some information about your JSON data so it was able to automatically infer that your data has four fields namely could be name cuisine locations in menu and could also get that the name in cuisine or watch are however since locations in menu are complex types themselves one is array and one is area for row it sort of uses the same be map format as ease to process them so it has four columns to two primitive of type watch R and 2 R P map themselves so now you can materialize these columns by altering the table definitions and adding columns of that particular type it inferred and then you can get better performance from this materialized columns and yeah it's basically it's not in a single column anymore you have four columns for the fare your restaurant data and you can get some column selection and other optimizations on on the data that Whittaker provides all right so that is three flex tables are basically helpful if you don't have a schema and if you don't have any type of permission however we saw earlier that some file formats like Parker and Avro have schema and have some type information so in those cases you don't have to do the first step of inputting the type so you can directly create the type external table definition of the type and then you can target it to the park a file and you can load it in by an external table in vertical so the same restaurants dot JSON if you call if you transfer it to a translations or park' format you can basically get the fields with look however the locations and menu are still in the B map format all right so the V map format also allows you to explode the data and it has some nice functions to yeah M extract the fields from P map format so you have this map items so the same restaurant later if you want to explode and you want to apply predicate on the fields of the RS and the address of pro you can have map items to export your data and then you can apply predicates on a particular field in the complex type data so on this slide is basically showing you how you can explode the entire data the menu items as well as the locations and basically give you the elements of each of these complex types up so as I mentioned the menus so if you go back to the previous slide the locations and menu items are still the bond binary or the V map format so the question is if you want what if you want to get perform better on the V map data so for primitive types you could materialize into the primitive style however if it's an array and array of row we will need some first-class complex type constructs and that is what we will see that are added in what is right now so Vertica has started to introduce complex stereotypes with where these complex types is sort of a strongly typed complex site so on this slide you have an example of a row complex type where so we create an external table called customers and you have a row type of twit to fields name and ID so the complex type is basically inlined into the tables into the column definition and on the second example you can see the create external table items which is unlisted row type so it has an item of type row which is so fast to peals name and the properties is again another nested row type with two fixed quantities label so these are basically strongly typed complex types and then the optimizer can now give you a better performance compared to the V map using the strongly typed information in their queries so we have support for pure rows and extra draws in external tables for power K we have support for arrays and nested arrays as well for external tables in power K so you can declare an external table called contacts with a flip phone number of array of integers similarly you can have a nested array of items of type integer we can declare a column with that strongly typed complex type so the other complex type support that we are adding in the thinner liz's support for optimized one dimensional arrays and sets for both ross and as well as RK external table so you can create internal table called phone numbers with a one-dimensional array so here you have phone numbers of array of type int you can have one dimensional you can have sets as well which is also one color one dimension arrays but sets are basically optimized for fast look ups they are have unique elements and they are ordered so big so you can get fast look ups using sets if that is a use case then set will give you very quick lookups for elements and we also implemented some functions to support arrays sets as well so you have applied min apply max which are scale out that you can apply on top of an array element and you can get the minimum element and so on so you can up you have support for additional functions as well so the other feature that is coming in ten o is the explored arrays of functionality so we have a implemented EU DX that will allow you to similar similar to the example you saw in the math items case you can extract elements from these arrays and you can apply different predicates or analysis on the elements so for example if you have this restaurant table with the column name watch our locations of each an area of archer and menu again an area watch our you can insert values using the array constructor into these columns so here we inserting three values lilies feed the with location with locations cambridge pittsburgh menu items cheese and pepperoni again another row with name restaurant named bob tacos location Houston and totila salsa and Patty on the third example so now you can basically explode the both arrays into and extract the elements out from these arrays so you can explode the location array and extract the location elements which is which are basically Houston Cambridge Pittsburgh New Jersey and also you can explode the menu items and extract individual elements and now you can sort of apply other predicates on the extruded data Kollek so so so let's see what are some usage considerations of these complex data types so complex data types as we saw earlier are nice if you have sparse data so if your data has clickstream or has some page view data then maps are very nice to have to represent your data and then you can sort of efficiently represent the in the space wise fashion for sparse data use a map types and compensate that as we saw earlier for the web request count query it will help you simplify the analysis as well you don't have to have joins and it will simplify your query analysis as I just mentioned if your use cases are for fast look ups then you can use a set type so arrays are nice but they have the ordering on them however if your primary use case to just look up for certain elements then we can use the set type also you can use the B map or the Flex functionality that we have in Vertica if you want flexibility in your complex set data type schema so like I mentioned earlier you can trivially implement needs like scheme evolution or even keep the complex types fluid so if you have multiple iterations of unit analysis and each iteration we are changing the fields because you're just exploring the data then we map and flex will give you that nice ease to change the fields within the complex type or across files and we can load fluid complex you can load complexity types with bit fluids is basically different fields in different Rho into V map and flex tables easily however if you're once you basically treated over your data you figured out what are the fields and the complex types that you really need you can use the strongly typed complex data types that we started to introduce in Vertica so you can use the array type the struct type in the map type for your data analysis so that's sort of the high level use cases for complex types in vertical so it depends on a lot on where your data analysis phase is fear early then your data is usually still fluid and you might want to use V Maps and flex to explore it once you finalize your schema you can use the strongly typed complex data types and to get the best possible performance holic so so what's coming in the following releases of Vertica so antenna which is coming in sometime now so yeah so we are adding which is the next release of vertical basically we're adding support for loading Park a complex data types to the V map format so parquet is a strongly typed file format basically it has the schema it also has the type information for each of the complex type however if you are exploring your data then you might have different park' files with different schemes so you can load them to the V map format first and then you can analyze your data and then you can switch to the strongly typed complex types we're also adding one dimensional optimized arrays and sets in growth and for parquet so yeah the complex sets are not just limited to parquet you can also store them in drawers however right now you only support one dimension arrays and set in rows we're also adding the Explorer du/dx for one-dimensional arrays in the in this release so you can as you saw in the previous example you can explode the data for of arrays in arrays and you can apply predicates on individual elements for the erase data so you can in it'll apply for set so you can cause them to milli to erase and Clinics code sets as well so what are the plans paths that you know release so we are going to continue both for strongly-typed computer types right now we don't have support for the full in the tail release we won't have support for the full all the combinations of complex types so we only have support for nested arrays sorriness listed pure arrays or nested pure rows and some are only limited to park a file format so we will continue to add more support for sub queries and nested complex sites in the following in the in following releases and we're also planning to add this B map data type so you saw in the examples that the V map data format is currently backed by the long word binary data format or the other column type because of this the optimizer really cannot distinguish which is a which is which data is actually a long wall binary or which is actually data and we map format so if we the idea is to basically add a type called V map and then the optimizer can now implement our support optimizations or even syntax such as dot notation and yeah if your data is columnar such as Parque then you can implement optimizations just keep push down where you can push the keys that are actually querying in your in your in your analysis and then only those keys should be loaded from parquet and built into the V map format so that way you get sort of the column selection optimization for complex types as well and yeah that's something you can achieve if you have different types for the V map format so that's something on the roadmap as well and then unless join is basically another nice to have feature right now if you want to explode and join the array elements you have to explode in the sub query and then in the outer query you have to join the data however if you have unless join till I love you to explode as well as join the data in the same query and on the fly you can do both and finally we are also adding support for this new feature called UD vector so that's on the plan too so our work for complex types is is essentially chain the fundamental way Vertica execute in the sense of functions and expression so right now all expressions in Vertica can return only a single column out acceptance in some cases like beauty transforms and so on but the scalar functions for instance if you take aut scalar you can get only one column out of it however if you have some use cases where you want to compute multiple computation so if you also have multiple computations on the same input data say you have input data of two integers and you want to compute both addition and multiplication on those two columns this is for example but in many many machine learning example use cases have similar patterns so say you want to do both these computations on the data at the same time then in the current approach you have to have one function for addition one function for multiplication and both of them will have to load the data once basically loading data twice to get both these computations turn however with the Uni vector support you can perform both these computations in the same function and you can return two columns out so essentially saving you the loading loading these columns twice you can only do it once and get both the results out so that's sort of what we are trying to implement with all the changes that we are doing to support complex data types in Vertica and also you don't have to use these over Clause like a uni transform so PD scale just like we do scalars you can have your a vector and you can have multiple columns returned from your computations so that sort of concludes my talk so thank you for listening to my presentation now we are ready for Q&A
**Summary and Sentiment Analysis are not been shown because of improper transcript**
ENTITIES
Entity | Category | Confidence |
---|---|---|
America | LOCATION | 0.99+ |
Jeff Healey | PERSON | 0.99+ |
second row | QUANTITY | 0.99+ |
Mary | PERSON | 0.99+ |
two rows | QUANTITY | 0.99+ |
two fields | QUANTITY | 0.99+ |
first row | QUANTITY | 0.99+ |
two rows | QUANTITY | 0.99+ |
two types | QUANTITY | 0.99+ |
each row | QUANTITY | 0.99+ |
two integers | QUANTITY | 0.99+ |
Deepak | PERSON | 0.99+ |
one function | QUANTITY | 0.99+ |
three fields | QUANTITY | 0.99+ |
fourth fields | QUANTITY | 0.99+ |
each element | QUANTITY | 0.99+ |
each field | QUANTITY | 0.99+ |
third | QUANTITY | 0.99+ |
more than thousand web requests | QUANTITY | 0.99+ |
second example | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
each key | QUANTITY | 0.99+ |
each table | QUANTITY | 0.99+ |
four fields | QUANTITY | 0.99+ |
third field | QUANTITY | 0.99+ |
first example | QUANTITY | 0.99+ |
Deepak Magette II | PERSON | 0.99+ |
two columns | QUANTITY | 0.99+ |
third category | QUANTITY | 0.99+ |
two columns | QUANTITY | 0.99+ |
two fields | QUANTITY | 0.99+ |
Houston | LOCATION | 0.99+ |
first step | QUANTITY | 0.99+ |
twice | QUANTITY | 0.99+ |
thousands of columns | QUANTITY | 0.98+ |
three values | QUANTITY | 0.98+ |
this week | DATE | 0.98+ |
more than one thousand website events | QUANTITY | 0.98+ |
third type | QUANTITY | 0.98+ |
each iteration | QUANTITY | 0.98+ |
both | QUANTITY | 0.98+ |
greater than thousand | QUANTITY | 0.98+ |
cambridge | LOCATION | 0.98+ |
JSON | TITLE | 0.98+ |
both arrays | QUANTITY | 0.97+ |
one column | QUANTITY | 0.97+ |
thousands of fields | QUANTITY | 0.97+ |
second | QUANTITY | 0.97+ |
third example | QUANTITY | 0.97+ |
two | QUANTITY | 0.97+ |
single column | QUANTITY | 0.96+ |
thousand | QUANTITY | 0.96+ |
Alex | PERSON | 0.96+ |
first | QUANTITY | 0.96+ |
BBC 2020 | ORGANIZATION | 0.96+ |
Vertica | TITLE | 0.96+ |
four columns | QUANTITY | 0.95+ |
once | QUANTITY | 0.95+ |
one type | QUANTITY | 0.95+ |
V Maps | TITLE | 0.94+ |
one color | QUANTITY | 0.94+ |
second type | QUANTITY | 0.94+ |
one dimension | QUANTITY | 0.94+ |
first two fields | QUANTITY | 0.93+ |
four tables | QUANTITY | 0.91+ |
each | QUANTITY | 0.91+ |
Namik Hrle, IBM | IBM Think 2018
>> Narrator: Live, from Las Vegas, it's theCUBE, covering IBM Think 2018, brought to you by IBM. >> Welcome back to theCUBE. We are live on day one of the inaugural IBM Think 2018 event. I'm Lisa Martin with Dave Vellante, and we are in sunny Vegas at the Mandalay Bay, excited to welcome to theCUBE, one of the IBM Fellows, Namik Hrle, welcome to theCUBE. >> Thank you so much. >> So you are not only an IBM Fellow, but you're also an IBM analytics technical leadership team chair. Tell us about you're role on that technical leadership team. What are some of the things that you're helping to drive? And maybe even give us some of the customer feedback that you're helping to infiltrate into IBM's technical direction. >> Okay, so basically, technical leadership team is a group of top technical leaders in the IBM analytics group, and we are kind of chartered by evaluating the new technologies, providing the guidance to our business leaders into what to invest, what to de-invest, listening to our customer requirements, listening to how the customers actually using the technology, and making sure that IBM is timely there when it's needed. And also very important element of the technical leadership team is also to promote the innovation, innovative activities, particularly kind of grass roots innovative activities. Meaning helping our technical leaders across the analytics, to encourage them to come up with innovation, to present the ideas to that, to follow up on those, to potentially turn them into projects, and so on. So that's it. >> And guide them, or just sort of send them off to discover? >> As a matter of fact, we should be probably mostly sounding board, so not necessarily that this is coming from top down, but trying to encourage them, trying to incite them, trying to kind of make the innovative activity interesting, and also at the same time, make sure that they see that there's something coming out of it. It's not just they are coming up, and then nothing's happening, but trying also to turn that into the reality by working with our business developers, which, by the way, who, by the way, they control the resources, right? So, in order to do something like that. >> How much of it is guiding folks on who want to go down a certain path that maybe you know has been attempted before in that particular way, so you know what probably better to go elsewhere? Or, do you let them go and make the same mistake? Is there any of that? Like, don't go down that, don't go through that door. >> Well, as you can imagine, it's human attempt to say, Well, you know, I've already tried, already done. but you know we are really trying not to do that. >> Yeah >> We are trying not to do that, trying to have an open mind, because in this industry in which we are there's always new set of opportunities, and new conditions, and even if you are going to talk about our current topic, like fast data, and so on, I believe that many of these things have been around already, we just didn't know how how to actually, how to help, how to support something like that. But now, with the new set of the knowledge we can actually do that. >> So, let's get into the fast data. I mean, wasn't too long ago, we just asked earlier guest what inning are we at in IOT? He said the third inning. It wasn't long ago we were in the third inning of a dupe, and everything was batched, and then all of a sudden, big data changed, everything became streaming, real-time, fast data. What do you mean by fast data? What is it? What's the state of fast data inside IBM? >> Well, thank you for that question, because I also wanted when I was preparing bit of this interview, of course, I wanted first to, that we are all on the same page in terms of what fast data actually means right? And there's of course in our industry, it's full of hype and misunderstanding and everything else. And like many other things and concepts, actually it's not fundamentally newest thing. It's just the fact that the current state of technology, and enhancements in the technology, allow us to do something that we couldn't do before. So, the requirements for the fast data value proposition were always there, but right now technology allows us actually to derive the real time inside out of the data, irrespective of the data volume, variety, velocity. And when I just said that three V's, it sounds like big data, right? >> Dave: Yeah. >> And, as a matter of fact, there is a pretty large intersection with big data, but there's a huge difference. And the huge difference that typically big data is really associated with data at rest, while the fast data is really associated with data in motion. So the examples of that particular patterns are all over the place. I mean, you can think of like a click stream and stuff. You can think about ticker financial data right? You can think about manufacturing IOT data, sensors, locks. And the spectrum of industries that take advantage of that are all over the place. From financial and retail, from manufacturing, from utilities, all the way to advertising, to agriculture, and everything else. So, I like, for example, very often when I talk about fast data, people first drop immediately into let's say, you know this have YouTube streaming, or this is Facebook, Twitter, kinds of postings, and everything else. While this is true, and certainly there are business cases built on something like that, what interests me more are the huge cases, like for example Airbus, right? With 10,000 sensors in each of the wings, for using 7 terabytes of information per day, which, by the way, cannot be just dumped somewhere like before, and then do some batch processing on it. But you actually have to process that data right there, when it happens, that millisecond because, you know, the ramifications are pretty, pretty serious about that, right? Or take for example, opportunity in the utility industry, like in power, electricity, where the distributors and manufacturers really entice people to put this smart metering in place. So, they can basically measure the consumption of power, electricity, power basically on a hourly basis. And instead of giving you once yearly, kind of bill, of what it is, to know that all the time, what is the consumption, to react on spikes, to avoid blackouts, and to come up with a totally new set of business models in terms of, you know, offering some special incentives for spending or not spending, adding additional manufacturers, I mean, fantastic set of use cases. I believe that Carter said that by 2020, like 80% of the businesses will have some sort of situational awareness obligation, which is not a world of basically using this kind of capability, of event driven messaging. And I agree with that 100%. >> So it's data, fast data is data that is analyzed in real time. >> Namik: Right. >> Such that you can affect an outcome [Namik] Right. >> Before, what, before something bad happens? Before you lose the buyer? Before-- >> All over the place. You know, before fraud happens in financials, right? Before manufacturing lines breaks, right? Before, you know, airplanes, something happens with the airplane. So there are many, many, many examples of something like that, right? And when we talk about it, what we need to understand, again, even the technologies that are needed in order to deliver fast data, value propositions, are kind of known technologies. I mean, what do you really need? You need very scalable POP SOP messaging systems like Kafka, for example, right? In order to acquire the data. Then you need a system which is typically a streaming system, streams, and you have tons of offerings in the open source space, like, you know, Apache Spark streaming, you have Storm, you have Fling, Apache Fling products, as well as you have our IBM Stream. Typically it is for really the kind of enterprise for your service delivery. And then, very importantly, and this is something that I hope we will have time to talk today, is you you also need to be able to basically absorb that data. And not only do the analytics on the fly, but also to store that data and combine that with analytics with the data that is historical. And typically for that, if you read what people are kind of suggesting what to do, you have also lots of open source technology that can do that, like a Sombra, like some HDFS based systems, and so on. But what I'm saying is all of them come with this kind of complexity that yes, you can have land data somewhere, but then you need to put it somewhere else in order to do the analytics. And basically, you are introducing the latency between data production and data consumption. And this is why I believe that the technology like DB2 event store, that we announced just yesterday, is actually something that will come very, very interestingly, a very powerful part of the whole files data story. >> So, let's talk about that a little bit more. Fast data as a term, and thank you for clarifying what it means to IBM, isn't new, but to your point, as technology is evolving, it's opening up new opportunities, much like, it sounds like kind of the innovation lab that you have within IBM, there might be, Dave was asking, ideas that people bring that aren't new, maybe they were tried before, but maybe now there's new enabling technologies. Tell us about how is IBM enabling organizations, whether they're fast paced innovative start ups, to enterprise organizations, not create that sort of latency and actually achieve the business benefits that fast data can help them achieve today with today's, or rather technologies that you're announcing at the show. >> Right, right. So again, let's go through these stages that I said that every fast data technology and project and solution should really probably have. As I said, first of all you need to have some messaging POP system, and I believe that the systems like Kafka are absolutely enough for something like that. >> Dave: Sure. >> Then you need a system that's going to take this data off that fire hose coming from the cuff, which is stream, stream technology, but and as I said, lots of technologies in the open source, but IBM Stream as a technology is something that has also hundreds of different basically models, whether predictive analytics, whether it's prescriptive analytics, whether machine learning, basically kind of AI elements, text to speech. If you can apply on the data, on the wire, with the wire speed, so you need that kind of enterprise quality of service in terms of applying the analytics on the data that is streaming, and then we come to the DB2 event store, basically a repository for that fire hose data. Where you can put this data in the format in which you can basically, immediately, without any latency between data creation and data consumption, do the analytics on it. That's what we did with our DB2 event store. So, not only that we can ingest, like millions of events per second, literally millions and millions events per second, but we can also store that in a basically open format, which is tremendous value. Remember, any data based system basically in the past, stores data in its own format. So you have to use that system that created data, in order to consume that data. >> Dave: Sure. >> What event, DB2 event store does, is actually, it ingest that data, puts it into the format that you can use any kind of open source product, like for example, Spark Analytics, to do the analytics on the data. You could use Spark Machine Learning Libraries to do immediately kind of machine learning, modeling as well as scoring, on that data. So, I believe that that particular element of event store, coupled with a tremendous capability to acquire data, is what makes a really differentiation. >> And it does that how? Through a set of API's that allows it to be read? >> So, basically, when the data is coming off the hose, you know, off the streams or something like that, what event store actually does, it puts the data, it's basically in memory database right? It puts the data in memory, >> Dave: Something else that's been around forever. >> Exactly, something else yeah. We just have more of it, right? (laughing) And guess what? If it is in memory, it's going to be faster than if it is on disk. What a surprise. >> Yeah. (chuckling) >> So, of course, when put the data into the memory, and immediately makes it basically available for querying, if you need this data that just came in. But then, kind of asynchronously, offloads the data into basically Apache Parquet format. Into the columnar store. Basically allowing very powerful analytical capabilities immediately on the data. And again, if you like, you can go to the event store to query that data, but you don't have to. You can basically use any kind of tool, like Spark, like Titan or Anaconda Stack, to go after the data and do the analytics on it, to build the models on it, and so on. >> And that asynchronous transformation is fast? >> Asynchronous transformation is such that it gives you this data, which we now call historical data, basically in a minute. >> Dave: Okay. >> So it's kind of like minutes. >> So reasonable low latency. >> But what's very important to understand that actually the union of that data and the data that is in the memory on this one, we by the way, make transparent, can give you 100% what we call kind of almost transactional consistency of your queries against the data that is kind of coming in. So, it's really now a hybrid kind of store, of the memory, in the memory, very fast log, because also logging this data in order for to have it for high visibility across multiple things because this is highly scalable, I mean, it's highly what we call web scale kind of data base. And then parquet format for the open source storing of the data for historic analysis. >> Let's in our last 30 seconds or so, give us some examples, I know this was just announced, but maybe a customer genericize in terms of the business benefits that one of the beta customers is achieving leveraging this technology. >> So, in order for customers really to take advantage of all that, as I said, what I would suggest customers to do first of all to understand where the situation or where these applications actually make sense to them. Where the data is coming in fire hoses, not in the traditional transactional capabilities, but through the fire hose. Where does it come? And then apply these technologies, as I just said. Acquisition of the data, streaming on the wire, analytics, and then DB2 event store as the sort of the data. For all that, what you also need, just to tell you, you also need kind of messaging run time, which typically products like, for example, ACCA technology, and that's why we have also, we have entered also in partnership with the Liebmann in order to deliver the entire, kind of experience, for customer that want to build application that run on a fast data. >> So maybe enabling customers to become more proactive maybe predictive, eventually? >> To enable customers to take advantage of this tremendously business relevant data, that is, data that is coming in the, is it the click stream? Is it financial data? Is it IOT data? And to combine it with the assets that they already have, coming from transactions, well, that's a powerful combination. That basically they can build totally brand new business models, as well as enhance existing ones, to something that is going to, you know, improve productivity, for example, or improve the customer satisfaction, or grow the customer segments, and so on and so forth. >> Well, Namik, thank you so much for coming on theCUBE, and sharing the insight of the announcements. It's pretty cool, Dave, I'm sittin' between you, and an IBM Fellow. >> Yeah, that's uh-- >> It's pretty good for a Monday. It's Monday, isn't it? >> Thank you so much. >> Not easy becoming an IBM Fellow, so congratulations on that. >> Thank you so much. >> Lisa: And thanks, again. >> Thank you for having me. >> Lisa: Absolutely, our pleasure. For Dave Vellante, I'm Lisa Martin. We are live at Mandalay Bay in Las Vegas. Nice, sunny day today, where we are on our first day of three days of coverage at IBM Think 2018. Check out our CUBE conversations on thecube.net. Head over to siliconangle.com to find our articles on everything we've done so far at this event and other events, and what we'll be doing for the next few days. Stick around, Dave and I are going to be right back, with our next guest after a short break. (innovative music)
SUMMARY :
covering IBM Think 2018, brought to you by IBM. We are live on day one of the inaugural What are some of the things that you're helping to drive? providing the guidance to our business leaders So, in order to do something like that. before in that particular way, so you know what Well, as you can imagine, it's human attempt to say, and new conditions, and even if you are going to talk So, let's get into the fast data. and enhancements in the technology, allow us to do something of that are all over the place. So it's data, fast data is data that is analyzed Such that you can affect an outcome that yes, you can have land data somewhere, that you have within IBM, there might be, and I believe that the systems like Kafka off that fire hose coming from the cuff, it ingest that data, puts it into the format If it is in memory, it's going to be faster to query that data, but you don't have to. it gives you this data, which we now call that is in the memory on this one, we by the way, that one of the beta customers Acquisition of the data, streaming on the wire, to something that is going to, you know, and sharing the insight of the announcements. It's pretty good for a Monday. so congratulations on that. for the next few days.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Dave Vellante | PERSON | 0.99+ |
Lisa Martin | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
Namik | PERSON | 0.99+ |
Namik Hrle | PERSON | 0.99+ |
100% | QUANTITY | 0.99+ |
millions | QUANTITY | 0.99+ |
Lisa | PERSON | 0.99+ |
Las Vegas | LOCATION | 0.99+ |
80% | QUANTITY | 0.99+ |
10,000 sensors | QUANTITY | 0.99+ |
Mandalay Bay | LOCATION | 0.99+ |
Monday | DATE | 0.99+ |
Liebmann | ORGANIZATION | 0.99+ |
siliconangle.com | OTHER | 0.99+ |
2020 | DATE | 0.99+ |
Carter | PERSON | 0.99+ |
three days | QUANTITY | 0.99+ |
third inning | QUANTITY | 0.99+ |
Apache | ORGANIZATION | 0.99+ |
thecube.net | OTHER | 0.99+ |
first | QUANTITY | 0.99+ |
yesterday | DATE | 0.99+ |
today | DATE | 0.99+ |
hundreds | QUANTITY | 0.98+ |
IBM Think 2018 | EVENT | 0.98+ |
ORGANIZATION | 0.98+ | |
Kafka | TITLE | 0.98+ |
Airbus | ORGANIZATION | 0.98+ |
Spark | TITLE | 0.98+ |
Namik | ORGANIZATION | 0.97+ |
YouTube | ORGANIZATION | 0.97+ |
ORGANIZATION | 0.96+ | |
first day | QUANTITY | 0.96+ |
Spark Analytics | TITLE | 0.95+ |
Anaconda Stack | TITLE | 0.95+ |
DB2 | TITLE | 0.95+ |
Titan | TITLE | 0.94+ |
millions of events per second | QUANTITY | 0.94+ |
three V | QUANTITY | 0.92+ |
a minute | QUANTITY | 0.92+ |
millions events per second | QUANTITY | 0.89+ |
day one | QUANTITY | 0.88+ |
Stream | COMMERCIAL_ITEM | 0.86+ |
each | QUANTITY | 0.84+ |
7 terabytes of information | QUANTITY | 0.75+ |
one | QUANTITY | 0.74+ |
Fling | TITLE | 0.71+ |
DB2 | EVENT | 0.66+ |
Storm | TITLE | 0.65+ |
ACCA | ORGANIZATION | 0.64+ |
theCUBE | ORGANIZATION | 0.64+ |
Robert Walsh, ZeniMax | PentahoWorld 2017
>> Announcer: Live from Orlando, Florida it's theCUBE covering Pentaho World 2017. Brought to you by Hitachi Vantara. (upbeat techno music) (coughs) >> Welcome to Day Two of theCUBE's live coverage of Pentaho World, brought to you by Hitachi Vantara. I'm your host Rebecca Knight along with my co-host Dave Vellante. We're joined by Robert Walsh. He is the Technical Director Enterprise Business Intelligence at ZeniMax. Thanks so much for coming on the show. >> Thank you, good morning. >> Good to see ya. >> I should say congratulations is in order (laughs) because you're company, ZeniMax, has been awarded the Pentaho Excellence Award for the Big Data category. I want to talk about the award, but first tell us a little bit about ZeniMax. >> Sure, so the company itself, so most people know us by the games versus the company corporate name. We make a lot of games. We're the third biggest company for gaming in America. And we make a lot of games such as Quake, Fallout, Skyrim, Doom. We have game launching this week called Wolfenstein. And so, most people know us by the games versus the corporate entity which is ZeniMax Media. >> Okay, okay. And as you said, you're the third largest gaming company in the country. So, tell us what you do there. >> So, myself and my team, we are primarily responsible for the ingestion and the evaluation of all the data from the organization. That includes really two main buckets. So, very simplistically we have the business world. So, the traditional money, users, then the graphics, people, sales. And on the other side we have the game. That's where a lot of people see the fun in what we do, such as what people are doing in the game, where in the game they're doing it, and why they're doing it. So, get a lot of data on gameplay behavior based on our playerbase. And we try and fuse those two together for the single viewer or customer. >> And that data comes from is it the console? Does it come from the ... What's the data flow? >> Yeah, so we actually support many different platforms. So, we have games on the console. So, Microsoft, Sony, PlayStation, Xbox, as well as the PC platform. Mac's for example, Android, and iOS. We support all platforms. So, the big challenge that we have is trying to unify that ingestion of data across all these different platforms in a unified way to facilitate downstream the reporting that we do as a company. >> Okay, so who ... When it says you're playing the game on a Microsoft console, whose data is that? Is it the user's data? Is it Microsoft's data? Is it ZeniMax's data? >> I see. So, many games that we actually release have a service act component. Most of our games are actually an online world. So, if you disconnect today people are still playing in that world. It never ends. So, in that situation, we have all the servers that people connect to from their desktop, from their console. Not all but most data we generate for the game comes from the servers that people connect to. We own those. >> Dave: Oh, okay. >> Which simplifies greatly getting that data from the people. >> Dave: So, it's your data? >> Exactly. >> What is the data telling you these days? >> Oh, wow, depends on the game. I think people realize what people do in games, what games have become. So, we have one game right now called Elder Scrolls Online, and this year we released the ability to buy in-game homes. And you can buy furniture for your in-game homes. So, you can furnish them. People can come and visit. And you can buy items, and weapons, and pets, and skins. And what's really interesting is part of the reason why we exist is to look at patterns and trends based on people interact with that environment. So for example, we'll see America playerbase buy very different items compared to say the European playerbase, based on social differences. And so, that helps immensely for the people who continuously develop the game to add items and features that people want to see and want to leverage. >> That is fascinating that Americans and Europeans are buying different furniture for their online homes. So, just give us some examples of the difference that you're seeing between these two groups. >> So, it's not just the homes, it applies to everything that they purchase as well. It's quite interesting. So, when it comes to the Americans versus Europeans for example what we find is that Europeans prefer much more cosmetic, passive experiences. Whereas the Americans are much things that stand out, things that are ... I'm trying to avoid stereotypes right now. >> Right exactly. >> It is what it is. >> Americans like ostentatious stuff. >> Robert: Exactly. >> We get it. >> Europeans are a bit more passive in that regard. And so, we do see that. >> Rebecca: Understated maybe. >> Thank you, that's a much better way of putting it. But games often have to be tweaked based on the environment. A different way of looking at it is a lot of companies in career in Asia all of these games in the West and they will have to tweak the game completely before it releases in these environments. Because players will behave differently and expect different things. And these games have become global. We have people playing all over the world all at the same time. So, how do you facilitate it? How do you support these different users with different needs in this one environment? Again, that's why BI has grown substantially in the gaming industry in the past five, ten years. >> Can you talk about the evolution of how you've been able to interact and essentially affect the user behavior or response to that behavior. You mentioned BI. So, you know, go back ten years it was very reactive. Not a lot of real time stuff going on. Are you now in the position to effect the behavior in real time, in a positive way? >> We're very close to that. We're not quite there yet. So yes, that's a very good point. So, five, ten years ago most games were traditional boxes. You makes a game, you get a box, Walmart or Gamestop, and then you're finished. The relationship with the customer ends. Now, we have this concept that's used often is games as a service. We provide an online environment, a service around a game, and people will play those games for weeks, months, if not years. And so, the shift as well as from a BI tech standpoint is one item where we've been able to streamline the ingest process. So, we're not real time but we can be hourly. Which is pretty responsive. But also, the fact that these games have become these online environments has enabled us to get this information. Five years ago, when the game was in a box, on the shelf, there was no connective tissue between us and them to interact and facilitate. With the games now being online, we can leverage BI. We can be more real time. We can respond quicker. But it's also due to the fact that now games themselves have changed to facilitate that interaction. >> Can you, Robert, paint a picture of the data pipeline? We started there with sort of the different devices. And you're bringing those in as sort of a blender. But take us through the data pipeline and how you're ultimately embedding or operationalizing those analytics. >> Sure. So, the game theater, the game and the business information, game theater is most likely 90, 95% of our total data footprint. We generate a lot more game information than we do business information. It's just due to how much we can track. We can do so. And so, a lot of these games will generate various game events, game logs that we can ingest into a single data lake. And we can use Amazon S3 for that. But it's not just a game theater. So, we have databases for financial information, account users, and so we will ingest the game events as well as the databases into one single location. At that point, however, it's still very raw. It's still very basic. We enable the analysts to actually interact with that. And they can go in there and get their feet wet but it's still very raw. The next step is really taking that raw information that is disjointed and separated, and unifying that into a single model that they can use in a much more performant way. In that first step, the analysts have the burden of a lot of the ETL work, to manipulate the data, to transform it, to make it useful. Which they can do. They should be doing the analysis, not the ingesting the data. And so, the progression from there into our warehouse is the next step of that pipeline. And so in there, we create these models and structures. And they're often born out of what the analysts are seeing and using in that initial data lake stage. So, they're repeating analysis, if they're doing this on a regular basis, the company wants something that's automated and auditable and productionized, then that's a great use case for promotion into our warehouse. You've got this initial staging layer. We have a warehouse where it's structured information. And we allow the analysts into both of those environments. So, they can pick their poison in respects. Structured data over here, raw and vast over here based on their use case. >> And what are the roles ... Just one more follow up, >> Yeah. >> if I may? Who are the people that are actually doing this work? Building the models, cleaning the data, and shoring data. You've got data scientists. You've got quality engineers. You got data engineers. You got application developers. Can you describe the collaboration between those roles? >> Sure. Yeah, so we as a BI organization we have two main groups. We have our engineering team. That's the one I drive. Then we have reporting, and that's a team. Now, we are really one single unit. We work as a team but we separate those two functions. And so, in my organization we have two main groups. We have our big data team which is doing that initial ingestion. Now, we ingest billions of troves of data a day. Terabytes a data a day. And so, we have a team just dedicated to ingestion, standardization, and exposing that first stage. Then we have our second team who are the warehouse engineers, who are actually here today somewhere. And they're the ones who are doing the modeling, the structuring. I mean the data modeling, making the data usable and promoting that into the warehouse. On the reporting team, basically we are there to support them. We provide these tool sets to engage and let them do their work. And so, in that team they have a very split of people do a lot of report development, visualization, data science. A lot of the individuals there will do all those three, two of the three, one of the three. But they do also have segmentation across your day to day reporting which has to function as well as the more deep analysis for data science or predictive analysis. >> And that data warehouse is on-prem? Is it in the cloud? >> Good question. Everything that I talked about is all in the cloud. About a year and a half, two years ago, we made the leap into the cloud. We drunk the Kool-Aid. As of Q2 next year at the very latest, we'll be 100% cloud. >> And the database infrastructure is Amazon? >> Correct. We use Amazon for all the BI platforms. >> Redshift or is it... >> Robert: Yes. >> Yeah, okay. >> That's where actually I want to go because you were talking about the architecture. So, I know you've mentioned Amazon Redshift. Cloudera is another one of your solutions provider. And of course, we're here in Pentaho World, Pentaho. You've described Pentaho as the glue. Can you expand on that a little bit? >> Absolutely. So, I've been talking about these two environments, these two worlds data lake to data warehouse. They're both are different in how they're developed, but it's really a single pipeline, as you said. And so, how do we get data from this raw form into this modeled structure? And that's where Pentaho comes into play. That's the glue. If the glue between these two environments, while they're conceptually very different they provide a singular purpose. But we need a way to unify that pipeline. And so, Pentaho we use very heavily to take this raw information, to transform it, ingest it, and model it into Redshift. And we can automate, we can schedule, we can provide error handling. And so it gives us the framework. And it's self-documenting to be able to track and understand from A to B, from raw to structured how we do that. And again, Pentaho is allowing us to make that transition. >> Pentaho 8.0 just came out yesterday. >> Hmm, it did? >> What are you most excited about there? Do you see any changes? We keep hearing a lot about the ability to scale with Pentaho World. >> Exactly. So, there's three things that really appeal to me actually on 8.0. So, things that we're missing that they've actually filled in with this release. So firstly, we on the streaming component from earlier the real time piece we were missing, we're looking at using Kafka and queuing for a lot of our ingestion purposes. And Pentaho in releasing this new version the mechanism to connect to that environment. That was good timing. We need that. Also too, get into more critical detail, the logs that we ingest, the data that we handle we use Avro and Parquet. When we can. We use JSON, Avro, and Parquet. Pentaho can handle JSON today. Avro, Parquet are coming in 8.0. And then lastly, to your point you made as well is where they're going with their system, they want to go into streaming, into all this information. It's very large and it has to go big. And so, they're adding, again, the ability to add worker nodes and scale horizontally their environment. And that's really a requirement before these other things can come into play. So, those are the things we're looking for. Our data lake can scale on demand. Our Redshift environment can scale on demand. Pentaho has not been able to but with this release they should be able to. And that was something that we've been hoping for for quite some time. >> I wonder if I can get your opinion on something. A little futures-oriented. You have a choice as an organization. You could just take roll your own opensource, best of breed opensource tools, and slog through that. And if you're an internet giant or a huge bank, you can do that. >> Robert: Right. >> You can take tooling like Pentaho which is end to end data pipeline, and this dramatically simplifies things. A lot of the cloud guys, Amazon, Microsoft, I guess to a certain extent Google, they're sort of picking off pieces of the value chain. And they're trying to come up with as a service fully-integrated pipeline. Maybe not best of breed but convenient. How do you see that shaking out generally? And then specifically, is that a challenge for Pentaho from your standpoint? >> So, you're right. That why they're trying to fill these gaps in their environment. To what Pentaho does and what they're offering, there's no comparison right now. They're not there yet. They're a long way away. >> Dave: You're saying the cloud guys are not there. >> No way. >> Pentaho is just so much more functional. >> Robert: They're not close. >> Okay. >> So, that's the first step. However, though what I've been finding in the cloud, there's lots of benefits from the ease of deployment, the scaling. You use a lot of dev ops support, DBA support. But the tools that they offer right now feel pretty bare bones. They're very generic. They have a place but they're not designed for singular purpose. Redshift is the only real piece of the pipeline that is a true Amazon product, but that came from a company called Power Excel ten years ago. They licensed that from a separate company. >> Dave: What a deal that was for Amazon! (Rebecca and Dave laugh) >> Exactly. And so, we like it because of the functionality Power Excel put in many year ago. Now, they've developed upon that. And it made it easier to deploy. But that's the core reason behind it. Now, we use for our big data environment, we use Data Breaks. Data Breaks is a cloud solution. They deploy into Amazon. And so, what I've been finding more and more is companies that are specialized in application or function who have their product support cloud deployment, is to me where it's a sweet middle ground. So, Pentaho is also talking about next year looking at Amazon deployment solutioning for their tool set. So, to me it's not really about going all Amazon. Oh, let's use all Amazon products. They're cheap and cheerful. We can make it work. We can hire ten engineers and hack out a solution. I think what's more applicable is people like Pentaho, whatever people in the industry who have the expertise and are specialized in that function who can allow their products to be deployed in that environment and leverage the Amazon advantages, the Elastic Compute, storage model, the deployment methodology. That is where I see the sweet spot. So, if Pentaho can get to that point, for me that's much more appealing than looking at Amazon trying to build out some things to replace Pentaho x years down the line. >> So, their challenge, if I can summarize, they've got to stay functionally ahead. Which they're way ahead now. They got to maintain that lead. They have to curate best of breed like Spark, for example, from Databricks. >> Right. >> Whatever's next and curate that in a way that is easy to integrate. And then look at the cloud's infrastructure. >> Right. Over the years, these companies that have been looking at ways to deploy into a data center easily and efficiently. Now, the cloud is the next option. How do they support and implement into the cloud in a way where we can leverage their tool set but in a way where we can leverage the cloud ecosystem. And that's the gap. And I think that's what we look for in companies today. And Pentaho is moving towards that. >> And so, that's a lot of good advice for Pentaho? >> I think so. I hope so. Yeah. If they do that, we'll be happy. So, we'll definitely take that. >> Is it Pen-ta-ho or Pent-a-ho? >> You've been saying Pent-a-ho with your British accent! But it is Pen-ta-ho. (laughter) Thank you. >> Dave: Cheap and cheerful, I love it. >> Rebecca: I know -- >> Bless your cotton socks! >> Yes. >> I've had it-- >> Dave: Cord and Bennett. >> Rebecca: Man, okay. Well, thank you so much, Robert. It's been a lot of fun talking to you. >> You're very welcome. >> We will have more from Pen-ta-ho World (laughter) brought to you by Hitachi Vantara just after this. (upbeat techno music)
SUMMARY :
Brought to you by Hitachi Vantara. He is the Technical Director for the Big Data category. Sure, so the company itself, gaming company in the country. And on the other side we have the game. from is it the console? So, the big challenge that Is it the user's data? So, many games that we actually release from the people. And so, that helps examples of the difference So, it's not just the homes, And so, we do see that. We have people playing all over the world affect the user behavior And so, the shift as well of the different devices. We enable the analysts to And what are the roles ... Who are the people that are and promoting that into the warehouse. about is all in the cloud. We use Amazon for all the BI platforms. You've described Pentaho as the glue. And so, Pentaho we use very heavily about the ability to scale the data that we handle And if you're an internet A lot of the cloud So, you're right. Dave: You're saying the Pentaho is just So, that's the first step. of the functionality They have to curate best of breed that is easy to integrate. And that's the gap. So, we'll definitely take that. But it is Pen-ta-ho. It's been a lot of fun talking to you. brought to you by Hitachi
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Dave Vellante | PERSON | 0.99+ |
Rebecca Knight | PERSON | 0.99+ |
Rebecca | PERSON | 0.99+ |
Robert Walsh | PERSON | 0.99+ |
Robert | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Pentaho | ORGANIZATION | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
Asia | LOCATION | 0.99+ |
Walmart | ORGANIZATION | 0.99+ |
America | LOCATION | 0.99+ |
ZeniMax Media | ORGANIZATION | 0.99+ |
ZeniMax | ORGANIZATION | 0.99+ |
Power Excel | TITLE | 0.99+ |
second team | QUANTITY | 0.99+ |
ORGANIZATION | 0.99+ | |
two | QUANTITY | 0.99+ |
two main groups | QUANTITY | 0.99+ |
two groups | QUANTITY | 0.99+ |
Wolfenstein | TITLE | 0.99+ |
one | QUANTITY | 0.99+ |
Orlando, Florida | LOCATION | 0.99+ |
Sony | ORGANIZATION | 0.99+ |
two functions | QUANTITY | 0.99+ |
three | QUANTITY | 0.99+ |
both | QUANTITY | 0.99+ |
90, 95% | QUANTITY | 0.99+ |
next year | DATE | 0.99+ |
Kool-Aid | ORGANIZATION | 0.99+ |
100% | QUANTITY | 0.99+ |
iOS | TITLE | 0.99+ |
today | DATE | 0.99+ |
Doom | TITLE | 0.99+ |
yesterday | DATE | 0.99+ |
Hitachi Vantara | ORGANIZATION | 0.99+ |
two main buckets | QUANTITY | 0.98+ |
Gamestop | ORGANIZATION | 0.98+ |
Fallout | TITLE | 0.98+ |
two environments | QUANTITY | 0.98+ |
first step | QUANTITY | 0.98+ |
one item | QUANTITY | 0.98+ |
Five years ago | DATE | 0.98+ |
Android | TITLE | 0.98+ |
one game | QUANTITY | 0.98+ |
Pentaho World | TITLE | 0.98+ |
three things | QUANTITY | 0.98+ |
first stage | QUANTITY | 0.98+ |
Pen-ta-ho World | ORGANIZATION | 0.98+ |
Pentaho Excellence Award | TITLE | 0.98+ |
this year | DATE | 0.98+ |
Nenshad Bardoliwalla & Stephanie McReynolds | BigData NYC 2017
>> Live from midtown Manhattan, it's theCUBE covering Big Data New York City 2017. Brought to you by Silicon Angle Media and its ecosystem sponsors. (upbeat techno music) >> Welcome back, everyone. Live here in New York, Day Three coverage, winding down for three days of wall to wall coverage theCUBE covering Big Data NYC in conjunction with Strata Data, formerly Strata Hadoop and Hadoop World, all part of the Big Data ecosystem. Our next guest is Nenshad Bardoliwalla Co-Founder and Chief Product Officer of Paxata, hot start up in the space. A lot of kudos. Of course, they launched on theCUBE in 2013 three years ago when we started theCUBE as a separate event from O'Reilly. So, great to see the success. And Stephanie McReynolds, you've been on multiple times, VP of Marketing at Alation. Welcome back, good to see you guys. >> Thank you. >> Happy to be here. >> So, winding down, so great kind of wrap-up segment here in addition to the partnership that you guys have. So, let's first talk about before we get to the wrap-up of the show and kind of bring together the week here and kind of summarize everything. Tell about your partnership you guys have. Paxata, you guys have been doing extremely well. Congratulations. Prakash was talking on theCUBE. Great success. You guys worked hard for it. I'm happy for you. But partnering is everything. Ecosystem is everything. Alation, their collaboration with data. That's there ethos. They're very user-centric. >> Nenshad: Yes. >> From the founders. Seemed like a good fit. What's the deal? >> It's a very natural fit between the two companies. When we started down the path of building new information management capabilities it became very clear that the market had strong need for both finding data, right? What do I actually have? I need an inventory, especially if my data's in Amazon S3, my data is in Azure Blob storage, my data is on-premise in HDFS, my data is in databases, it's all over the place. And I need to be able to find it. And then once I find it, I want to be able to prepare it. And so, one of the things that really drove this partnership was the very common interests that both companies have. And number one, pushing user experience. I love the Alation product. It's very easy to use, it's very intuitive, really it's a delightful thing to work with. And at the same time they also share our interests in working in these hybrid multicloud environments. So, what we've done and what we announced here at Strata is actually this bi-directional integration between the products. You can start in Alation and find a data set that you want to work with, see what collaboration or notes or business metadata people have created and then say, I want to go see this in Paxata. And in a single click you can then actually open it up in Paxata and profile that data. Vice versa you can also be in Paxata and prepare data, and then with a single click push it back, and then everybody who works with Alation actually now has knowledge of where that data is. So, it's a really nice synergy. >> So, you pushed the user data back to Alation, cause that's what they care a lot about, the cataloging and making the user-centric view work. So, you provide, it's almost a flow back and forth. It's a handshake if you will to data. Am I getting that right? >> Yeah, I mean, the idea's to keep the analyst or the user of that data, data scientist, even in some cases a business user, keep them in the flow of their work as much as possible. But give them the advantage of understanding what others in the organization have done with that data prior and allow them to transform it, and then share that knowledge back with the rest of the community that might be working with that data. >> John: So, give me an example. I like your Excel spreadsheet concept cause that's obvious. People know what Excel spreadsheet is so. So, it's Excel-like. That's an easy TAM to go after. All Microsoft users might not get that Azure thing. But this one, just take me through a usecase. >> So, I've got a good example. >> Okay, take me through. >> It's very common in a data lake for your data to be compressed. And when data's compressed, to a user it looks like a black box. So, if the data is compressed in Avro or Parquet or it's even like JSON format. A business user has no idea what's in that file. >> John: Yeah. >> So, what we do is we find the file for them. It may have some comments on that file of how that data's been used in past projects that we infer from looking at how others have used that data in Alation. >> John: So, you put metadata around it. >> We put a whole bunch of metadata around it. It might be comments that people have made. It might be >> Annotations, yeah. >> actual observations, annotations. And the great thing that we can do with Paxata is open that Avro file or Parquet file, open it up so that you can actually see the data elements themselves. So, all of a sudden, the business user has access without having to use a command line utility or understand anything about compression, and how you open that file up-- >> John: So, as Paxata spitting out there nuggets of value back to you, you're kind of understanding it, translating it to the user. And they get to do their thing, you get to do your thing, right? >> It's making a Avro or a Parquet file as easy to use as Excel, basically. Which is great, right? >> It's awesome. >> Now, you've enabled >> a whole new class of people who can use that. >> Well, and people just >> Get turned off when it's anything like jargon, or like, "What is that? I'm afraid it's phishing. Click on that and oh!" >> Well, the scary thing is that in a data lake environment, in a lot of cases people don't even label the files with extensions. They're just files. (Stephanie laughs) So, what started-- >> It's like getting your pictures like DS, JPEG. It's like what? >> Exactly. >> Right. >> So, you're talking about unlabeled-- >> If you looked on your laptop, and if you didn't have JPEG or DOC or PPT. Okay, I don't know that this file is. Well, what you have in the data lake environment is that you have thousands of these files that people don't really know what they are. And so, with Alation we have the ability to get all the value around the curation of the metadata, and how people are using that data. But then somebody says, "Okay, but I understand that this file exists. What's in it?" And then with Click to Profile from Alation you're immediately taken into Paxata. And now you're actually looking at what's in that file. So, you can very quickly go from this looks interesting to let me understand what's inside of it. And that's very powerful. >> Talk about Alation. Cause I had the CEO on, also their lead investor Greg Sands from Costanoa Ventures. They're a pretty amazing team but it's kind of out there. No offense, it's kind of a compliment actually. (Stephanie laughs) >> They got a symbolic >> Stephanie: Keep going. system Stanford guy, who's like super-smart. >> Nenshad: Yeah. >> They're on something that's really unique but it's almost too simple to be. Like, wait a minute! Google for the data, it's an awesome opportunity. How do you describe Alation to people who say, "Hey, what's this Alation thing?" >> Yeah, so I think that the best way to describe it is it's the browser for all of the distributed data in the enterprise. Sorry, so it's both the catalog, and the browser that sits on top of it. It sounds very simple. Conceptually it's very simple but they have a lot of richness in what they're able to do behind the scenes in terms of introspecting what type of work people are doing with data, and then taking that knowledge and actually surfacing it to the end user. So, for example, they have very powerful scenarios where they can watch what people are doing in different data sources, and then based on that information actually bubble up how queries are being used or the different patterns that people are doing to consume data with. So, what we find really exciting is that this is something that is very complex under the covers. Which Paxata is as well being built upon Spark. But they have put in the hard engineering work so that it looks simple to the end user. And that's the exact same thing that we've tried to do. >> And that's the hard problem. Okay, Stephanie back ... That was a great example by the way. Can't wait to have our little analyst breakdown of the event. But back to Alation for you. So, how do you talk about, you've been VP of Marketing of Alation. But you've been around the block. You know B2B, tech, big data. So, you've seen a bunch of different, you've worked at Trifacta, you worked at other companies, and you've seen a lot of waves of innovation come. What's different about Alation that people might not know about? How do you describe the difference? Because it sounds easy, "Oh, it's a browser! It's a catalog!" But it's really hard. Is it the tech that's the secret? Is it the approach? How do you describe the value of Alation? I think what's interesting about Alation is that we're solving a problem that since the dawn of the data warehouse has not been solved. And that is how to help end users really find and understand the data that they need to do their jobs. A lot of our customers talk about this-- >> John: Hold on. Repeat that. Cause that's like a key thing. What problem hasn't been solved since the data warehouse? >> To be able to actually find and fully understand, understand to the point of trust the data that you want to use for your analysis. And so, in the world of-- >> John: That sounds so simple. >> Stephanie: In the world of data warehousing-- >> John: Why is it so hard? >> Well, because in the world of data warehousing business people were told what data they should use. Someone in IT decided how to model the data, came up with a KPR calculation, and told you as a business person, you as a CEO, this is how you're going to monitor you business. >> John: Yeah. >> What business person >> Wants to be told that by an IT guy, right? >> Well, it was bounded by IT. >> Right. >> Expression and discovery >> Should be unbounded. Machine learning can take care of a lot of bounded stuff. I get that. But like, when you start to get into the discovery side of it, it should be free. >> Well, no offense to the IT team, but they were doing their best to try to figure out how to make this technology work. >> Well, just look at the cost of goods sold for storage. I mean, how many EMC drives? Expensive! IT was not cheap. >> Right. >> Not even 10, 15, 20 years ago. >> So, now when we have more self-service access to data, and we can have more exploratory analysis. What data science really introduced and Hadoop introduced was this ability on-demand to be able to create these structures, you have this more iterative world of how you can discover and explore datasets to come to an insight. The only challenge is, without simplifying that process, a business person is still lost, right? >> John: Yeah. >> Still lost in the data. >> So, we simply call that a catalog. But a catalog is much more-- >> Index, catalog, anthology, there's other words for it, right? >> Yeah, but I think it's interesting because like a concept of a catalog is an inventory has been around forever in this space. But the concept of a catalog that learns from other's behavior with that data, this concept of Behavior I/O that Aaron talked about earlier today. The fact that behavior of how people query data as an input and that input then informs a recommendation as an output is very powerful. And that's where all the machine learning and A.I. comes to work. It's hidden underneath that concept of Behavior I/O but that's there real innovation that drives this rich catalog is how can we make active recommendations to a business person who doesn't have to understand the technology but they know how to apply that data to making a decision. >> Yeah, that's key. Behavior and textual information has always been the two fly wheels in analysis whether you're talking search engine or data in general. And I think what I like about the trends here at Big Data NYC this weekend. We've certainly been seeing it at the hundreds of CUBE events we've gone to over the past 12 months and more is that people are using data differently. Not only say differently, there's baselining, foundational things you got to do. But the real innovators have a twist on it that give them an advantage. They see how they can use data. And the trend is collective intelligence of the customer seems to be big. You guys are doing it. You're seeing patterns. You're automating the data. So, it seems to be this fly wheel of some data, get some collective data. What's your thoughts and reactions. Are people getting it? Is this by people doing it by accident on purpose kind of thing? Did people just fell on their head? Or you see, "Oh, I just backed into this?" >> I think that the companies that have emerged as the leaders in the last 15 or 20 years, Google being a great example, Amazon being a great example. These are companies whose entire business models were based on data. They've generated out-sized returns. They are the leaders on the stock market. And I think that many companies have awoken to the fact that data as a monetizable asset to be turned into information either for analysis, to be turned into information for generating new products that can then be resold on the market. The leading edge companies have figured that out, and our adopting technologies like Alation, like Paxata, to get a competitive advantage in the business processes where they know they can make a difference inside of the enterprise. So, I don't think it's a fluke at all. I think that most of these companies are being forced to go down that path because they have been shown the way in terms of the digital giants that are currently ruling the enterprise tech world. >> All right, what's your thoughts on the week this week so far on the big trends? What are obvious, obviously A.I., don't need to talk about A.I., but what were the big things that came out of it? And what surprised you that didn't come out from a trends standpoint buzz here at Strata Data and Big Data NYC? What were the big themes that you saw emerge and didn't emerge what was the surprise? Any surprises? >> Basically, we're seeing in general the maturation of the market finally. People are finally realizing that, hey, it's not just about cool technology. It's not about what distribution or package. It's about can you actually drive return on investment? Can you actually drive insights and results from the stack? And so, even the technologists that we were talking with today throughout the course of the show are starting to talk about it's that last mile of making the humans more intelligent about navigating this data, where all the breakthroughs are going to happen. Even in places like IOT, where you think about a lot of automation, and you think about a lot of capability to use deep learning to maybe make some decisions. There's still a lot of human training that goes into that decision-making process and having agency at the edge. And so I think this acknowledgement that there should be balance between human input and what the technology can do is a nice breakthrough that's going to help us get to the next level. >> What's missing? What do you see that people missed that is super-important, that wasn't talked much about? Is there anything that jumps out at you? I'll let you think about it. Nenshad, you have something now. >> Yeah, I would say I completely agree with what Stephanie said which we are seeing the market mature. >> John: Yeah. >> And there is a compelling force to now justify business value for all the investments people have made. The science experiment phase of the big data world is over. People now have to show a return on that investment. I think that being said though, this is my sort of way of being a little more provocative. I still think there's way too much emphasis on data science and not enough emphasis on the average business analyst who's doing work in the Fortune 500. >> It should be kind of the same thing. I mean, with data science you're just more of an advanced analyst maybe. >> Right. But the idea that every person who works with data is suddenly going to understand different types of machine learning models, and what's the right way to do hyper parameter tuning, and other words that I could throw at you to show that I'm smart. (laughter) >> You guys have a vision with the Excel thing. I could see how you see that perspective because you see a future. I just think we're not there yet because I think the data scientists are still handcuffed and hamstrung by the fact that they're doing too much provisioning work, right? >> Yeah. >> To you're point about >> surfacing the insights, it's like the data scientists, "Oh, you own it now!" They become the sysadmin, if you will, for their department. And it's like it's not their job. >> Well, we need to get them out of data preparation, right? >> Yeah, get out of that. >> You shouldn't be a data scientist-- >> Right now, you have two values. You've got the use interface value, which I love, but you guys do the automation. So, I think we're getting there. I see where you're coming from, but still those data sciences have to set the tone for the generation, right? So, it's kind of like you got to get those guys productive. >> And it's not a .. Please go ahead. >> I mean, it's somewhat interesting if you look at can the data scientist start to collaborate a little bit more with the common business person? You start to think about it as a little bit of scientific inquiry process. >> John: Yeah. >> Right? >> If you can have more innovators around the table in a common place to discuss what are the insights in this data, and people are bringing business perspective together with machine learning perspective, or the knowledge of the higher algorithms, then maybe you can bring those next leaps forward. >> Great insight. If you want my observations, I use the crazy analogy. Here's my crazy analogy. Years it's been about the engine Model T, the car, the horse and buggy, you know? Now, "We got an engine in the car!" And they got wheels, it's got a chassis. And so, it's about the apparatus of the car. And then it evolved to, "Hey, this thing actually drives. It's transportation." You can actually go from A to B faster than the other guys, and people still think there's a horse and buggy market out there. So, they got to go to that. But now people are crashing. Now, there's an art to driving the car. >> Right. >> So, whether you're a sports car or whatever, this is where the value piece I think hits home is that, people are driving the data now. They're driving the value proposition. So, I think that, to me, the big surprise here is how people aren't getting into the hype cycle. They like the hype in terms of lead gen, and A.I., but they're too busy for the hype. It's like, drive the value. This is not just B.S. either, outcomes. It's like, "I'm busy. I got security. I got app development." >> And I think they're getting smarter about how their valuing data. We're starting to see some economic models, and some ways of putting actual numbers on what impact is this data having today. We do a lot of usage analysis with our customers, and looking at they have a goal to distribute data across more of the organization, and really get people using it in a self-service manner. And from that, you're being able to calculate what actually is the impact. We're not just storing this for insurance policy reasons. >> Yeah, yeah. >> And this cheap-- >> John: It's not some POC. Don't do a POC. All right, so we're going to end the day and the segment on you guys having the last word. I want to phrase it this way. Share an anecdotal story you've heard from a customer, or a prospective customer, that looked at your product, not the joint product but your products each, that blew you away, and that would be a good thing to leave people with. What was the coolest or nicest thing you've heard someone say about Alation and Paxata? >> For me, the coolest thing they said, "This was a social network for nerds. I finally feel like I've found my home." (laughter) >> Data nerds, okay. >> Data nerds. So, if you're a data nerd, you want to network, Alation is the place you want to be. >> So, there is like profiles? And like, you guys have a profile for everybody who comes in? >> Yeah, so the interesting thing is part of our automation, when we go and we index the data sources we also index the people that are accessing those sources. So, you kind of have a leaderboard now of data users, that contract one another in system. >> John: Ooh. >> And at eBay leader was this guy, Caleb, who was their data scientist. And Caleb was famous because everyone in the organization would ask Caleb to prepare data for them. And Caleb was like well known if you were around eBay for awhile. >> John: Yeah, he was the master of the domain. >> And then when we turned on, you know, we were indexing tables on teradata as well as their Hadoop implementation. And all of a sudden, there are table structures that are Caleb underscore cussed. Caleb underscore revenue. Caleb underscore ... We're like, "Wow!" Caleb drove a lot of teradata revenue. (Laughs) >> Awesome. >> Paxata, what was the coolest thing someone said about you in terms of being the nicest or coolest most relevant thing? >> So, something that a prospect said earlier this week is that, "I've been hearing in our personal lives about self-driving cars. But seeing your product and where you're going with it I see the path towards self-driving data." And that's really what we need to aspire towards. It's not about spending hours doing prep. It's not about spending hours doing manual inventories. It's about getting to the point that you can automate the usage to get to the outcomes that people are looking for. So, I'm looking forward to self-driving information. Nenshad, thanks so much. Stephanie from Alation. Thanks so much. Congratulations both on your success. And great to see you guys partnering. Big, big community here. And just the beginning. We see the big waves coming, so thanks for sharing perspective. >> Thank you very much. >> And your color commentary on our wrap up segment here for Big Data NYC. This is theCUBE live from New York, wrapping up great three days of coverage here in Manhattan. I'm John Furrier. Thanks for watching. See you next time. (upbeat techo music)
SUMMARY :
Brought to you by Silicon Angle Media and Hadoop World, all part of the Big Data ecosystem. in addition to the partnership that you guys have. What's the deal? And so, one of the things that really drove this partnership So, you pushed the user data back to Alation, Yeah, I mean, the idea's to keep the analyst That's an easy TAM to go after. So, if the data is compressed in Avro or Parquet of how that data's been used in past projects It might be comments that people have made. And the great thing that we can do with Paxata And they get to do their thing, as easy to use as Excel, basically. a whole new class of people Click on that and oh!" the files with extensions. It's like getting your pictures like DS, JPEG. is that you have thousands of these files Cause I had the CEO on, also their lead investor Stephanie: Keep going. Google for the data, it's an awesome opportunity. And that's the exact same thing that we've tried to do. And that's the hard problem. What problem hasn't been solved since the data warehouse? the data that you want to use for your analysis. Well, because in the world of data warehousing But like, when you start to get into to the IT team, but they were doing Well, just look at the cost of goods sold for storage. of how you can discover and explore datasets So, we simply call that a catalog. But the concept of a catalog that learns of the customer seems to be big. And I think that many companies have awoken to the fact And what surprised you that didn't come out And so, even the technologists What do you see that people missed the market mature. in the Fortune 500. It should be kind of the same thing. But the idea that every person and hamstrung by the fact that they're doing They become the sysadmin, if you will, So, it's kind of like you got to get those guys productive. And it's not a .. can the data scientist start to collaborate or the knowledge of the higher algorithms, the car, the horse and buggy, you know? So, I think that, to me, the big surprise here is across more of the organization, and the segment on you guys having the last word. For me, the coolest thing they said, Alation is the place you want to be. Yeah, so the interesting thing is if you were around eBay for awhile. And all of a sudden, there are table structures And great to see you guys partnering. See you next time.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Stephanie | PERSON | 0.99+ |
Stephanie McReynolds | PERSON | 0.99+ |
Greg Sands | PERSON | 0.99+ |
John | PERSON | 0.99+ |
Caleb | PERSON | 0.99+ |
John Furrier | PERSON | 0.99+ |
Nenshad | PERSON | 0.99+ |
New York | LOCATION | 0.99+ |
Prakash | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Aaron | PERSON | 0.99+ |
Silicon Angle Media | ORGANIZATION | 0.99+ |
2013 | DATE | 0.99+ |
thousands | QUANTITY | 0.99+ |
Costanoa Ventures | ORGANIZATION | 0.99+ |
Manhattan | LOCATION | 0.99+ |
two companies | QUANTITY | 0.99+ |
both companies | QUANTITY | 0.99+ |
Excel | TITLE | 0.99+ |
Trifacta | ORGANIZATION | 0.99+ |
ORGANIZATION | 0.99+ | |
Strata Data | ORGANIZATION | 0.99+ |
Alation | ORGANIZATION | 0.99+ |
Paxata | ORGANIZATION | 0.99+ |
Nenshad Bardoliwalla | PERSON | 0.99+ |
eBay | ORGANIZATION | 0.99+ |
three days | QUANTITY | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
two values | QUANTITY | 0.99+ |
NYC | LOCATION | 0.99+ |
hundreds | QUANTITY | 0.99+ |
Big Data | ORGANIZATION | 0.99+ |
first | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
both | QUANTITY | 0.99+ |
Strata Hadoop | ORGANIZATION | 0.99+ |
Hadoop World | ORGANIZATION | 0.99+ |
earlier this week | DATE | 0.98+ |
Paxata | PERSON | 0.98+ |
today | DATE | 0.98+ |
Day Three | QUANTITY | 0.98+ |
Parquet | TITLE | 0.96+ |
three years ago | DATE | 0.96+ |
Yaron Haviv, iguazio | BigData NYC 2017
>> Announcer: Live from midtown Manhattan, it's theCUBE, covering BigData New York City 2017, brought to you by SiliconANGLE Media and its ecosystem sponsors. >> Okay, welcome back everyone, we're live in New York City, this is theCUBE's coverage of BigData NYC, this is our own event for five years now we've been running it, been at Hadoop World since 2010, it's our eighth year covering the Hadoop World which has evolved into Strata Conference, Strata Hadoop, now called Strata Data, and of course it's bigger than just Strata, it's about big data in NYC, a lot of big players here inside theCUBE, thought leaders, entrepreneurs, and great guests. I'm John Furrier, the cohost this week with Jim Kobielus, who's the lead analyst on our BigData and our Wikibon team. Our next guest is Yaron Haviv, who's with iguazio, he's the founder and CTO, hot startup here at the show, making a lot of waves on their new platform. Welcome to theCUBE, good to see you again, congratulations. >> Yes, thanks, thanks very much. We're happy to be here again. >> You're known in the theCUBE community as the guy on Twitter who's always pinging me and Dave and team, saying, "Hey, you know, you guys got to "get that right." You really are one of the smartest guys on the network in our community, you're super-smart, your team has got great tech chops, and in the middle of all that is the hottest market which is cloud native, cloud native as it relates to the integration of how apps are being built, and essentially new ways of engineering around these solutions, not just repackaging old stuff, it's really about putting things in a true cloud environment, with an application development, with data at the center of it, you got a whole complex platform you've introduced. So really, really want to dig into this. So before we get into some of my pointed questions I know Jim's got a ton of questions, is give us an update on what's going on so you guys got some news here at the show, let's get to that first. >> So since the last time we spoke, we had tons of news. We're making revenues, we have customers, we've just recently GA'ed, we recently got significant investment from major investors, we raised about $33 million recently from companies like Verizon Ventures, Bosch, you know for IoT, Chicago Mercantile Exchange, which is Dow Jones and other properties, Dell EMC. So pretty broad. >> John: So customers, pretty much. >> Yeah, so that's the interesting thing. Usually you know investors are sort of strategic investors or partners or potential buyers, but here it's essentially our customers that it's so strategic to the business, we want to... >> Let's go with GA of the projects, just get into what's shipping, what's available, what's the general availability, what are you now offering? >> So iguazio is trying to, you know, you alluded to cloud native and all that. Usually when you go to events like Strata and BigData it's nothing to do with cloud native, a lot of hard labor, not really continuous development and integration, it's like continuous hard work, it's continuous hard work. And essentially what we did, we created a data platform which is extremely fast and integrated, you know has all the different forms of states, streaming and events and documents and tables and all that, into a very unique architecture, won't dive into that today. And on top of it we've integrated cloud services like Kubernetes and serverless functionality and others, so we can essentially create a hybrid cloud. So some of our customers they even deploy portions as an Opix-based settings in the cloud, and some portions in the edge or in the enterprise deployed the software, or even a prepackaged appliance. So we're the only ones that provide a full hybrid experience. >> John: Is this a SAS product? >> So it's a software stack, and it could be delivered in three different options. One, if you don't want to mess with the hardware, you can just rent it, and it's deployed in Equanix facility, we have very strong partnerships with them globally. If you want to have something on-prem, you can get a software reference architecture, you go and deploy it. If you're a telco or an IoT player that wants a manufacturing facility, we have a very small 2U box, four servers, four GPUs, all the analytics tech you could think of. You just put it in the factory instead of like two racks of Hadoop. >> So you're not general purpose, you're just whatever the customer wants to deploy the stack, their flexibility is on them. >> Yeah. Now it is an appliance >> You have a hosting solution? >> It is an appliance even when you deploy it on-prem, it's a bunch of Docker containers inside that you don't even touch them, you don't SSH to the machine. You have APIs and you have UIs, and just like the cloud experience when you go to Amazon, you don't open the Kimono, you know, you just use it. So our experience that's what we're telling customers. No root access problems, no security problems. It's a hardened system. Give us servers, we'll deploy it, and you go through consoles and UIs, >> You don't host anything for anyone? >> We host for some customers, including >> So you do whatever the customer was interested in doing? >> Yes. (laughs) >> So you're flexible, okay. >> We just want to make money. >> You're pretty good, sticking to the product. So on the GA, so here essentially the big data world you mentioned that there's data layers, like data piece. So I got to ask you the question, so pretend I'm an idiot for a second, right. >> Yaron: Okay. >> Okay, yeah. >> No, you're a smart guy. >> What problem are you solving. So we'll just go to the simple. I love what you're doing, I assume you guys are super-smart, which I can say you are, but what's the problem you're solving, what's in it for me? >> Okay, so there are two problems. One is the challenge everyone wants to transform. You know there is this digital transformation mantra. And it means essentially two things. One is, I want to automate my operation environment so I can cut costs and be more competitive. The other one is I want to improve my customer engagement. You know, I want to do mobile apps which are smarter, you know get more direct content to the user, get more targeted functionality, et cetera. These are the two key challenges for every business, any industry, okay? So they go and they deploy Hadoop and Hive and all that stuff, and it takes them two years to productize it. And then they get to the data science bit. And by the time they finished they understand that this Hadoop thing can only do one thing. It's queries, and reporting and BI, and data warehousing. How do you do actionable insights from that stuff, okay? 'Cause actionable insights means I get information from the mobile app, and then I translate it into some action. I have to enrich the vectors, the machine learning, all that details. And then I need to respond. Hadoop doesn't know how to do it. So the first generation is people that pulled a lot of stuff into data lake, and started querying it and generating reports. And the boss said >> Low cost data link basically, was what you say. >> Yes, and the boss said, "Okay, what are we going to do with this report? "Is it generating any revenue to the business?" No. The only revenue generation if you take this data >> You're fired, exactly. >> No, not all fired, but now >> John: Look at the budget >> Now they're starting to buy our stuff. So now the point is okay, how can I put all this data, and in the same time generate actions, and also deal with the production aspects of, I want to develop in a beta phase, I want to promote it into production. That's cloud native architectures, okay? Hadoop is not cloud, How do I take a Spark, Zeppelin, you know, a notebook and I turn it into production? There's no way to do that. >> By the way, depending on which cloud you go to, they have a different mechanism and elements for each cloud. >> Yeah, so the cloud providers do address that because they are selling the package, >> Expands all the clouds, yeah. >> Yeah, so cloud providers are starting to have their own offerings which are all proprietary around this is how you would, you know, forget about HDFS, we'll have S3, and we'll have Redshift for you, and we'll have Athena, and again you're starting to consume that into a service. Still doesn't address the continuous analytics challenge that people have. And if you're looking at what we've done with Grab, which is amazing, they started with using Amazon services, S3, Redshift, you know, Kinesis, all that stuff, and it took them about two hours to generate the insights. Now the problem is they want to do driver incentives in real time. So they want to incent the driver to go and make more rides or other things, so they have to analyze the event of the location of the driver, the event of the location of the customers, and just throwing messages back based on analytics. So that's real time analytics, and that's not something that you can do >> They got to build that from scratch right away. I mean they can't do that with the existing. >> No, and Uber invested tons of energy around that and they don't get the same functionality. Another unique feature that we talk about in our PR >> This is for the use case you're talking about, this is the Grab, which is the car >> Grab is the number one ride-sharing in Asia, which is bigger than Uber in Asia, and they're using our platform. By the way, even Uber doesn't really use Hadoop, they use MemSQL for that stuff, so it's not really using open source and all that. But the point is for example, with Uber, when you have a, when they monetize the rides, they do it just based on demand, okay. And with Grab, now what they do, because of the capability that we can intersect tons of data in real time, they can also look at the weather, was there a terror attack or something like that. They don't want to raise the price >> A lot of other data points, could be traffic >> They don't want to raise the price if there was a problem, you know, and all the customers get aggravated. This is actually intersecting data in real time, and no one today can do that in real time beyond what we can do. >> A lot of people have semantic problems with real time, they don't even know what they mean by real time. >> Yaron: Yes. >> The data could be a week old, but they can get it to them in real time. >> But every decision, if you think if you generalize round the problem, okay, and we have slides on that that I explain to customers. Every time I run analytics, I need to look at four types of data. The context, the event, okay, what happened, okay. The second type of data is the previous state. Like I have a car, was it up or down or what's the previous state of that element? The third element is the time aggregation, like, what happened in the last hour, the average temperature, the average, you know, ticker price for the stock, et cetera, okay? And the fourth thing is enriched data, like I have a car ID, but what's the make, what's the model, who's driving it right now. That's secondary data. So every time I run a machine learning task or any decision I have to collect all those four types of data into one vector, it's called feature vector, and take a decision on that. You take Kafka, it's only the event part, okay, you take MemSQL, it's only the state part, you take Hadoop it's only like historical stuff. How do you assemble and stitch a feature vector. >> Well you talked about complex machine learning pipeline, so clearly, you're talking about a hybrid >> It's a prediction. And actions based on just dumb things, like the car broke and I need to send a garage, I don't need machine learning for that. >> So within your environment then, do you enable the machine learning models to execute across the different data platforms, of which this hybrid environment is composed, and then do you aggregate the results of those models, runs into some larger model that drives the real time decision? >> In our solution, everything is a document, so even a picture is a document, a lot of things. So you can essentially throw in a picture, run tensor flow, embed more features into the document, and then query those features on another platform. So that's really what makes this continuous analytics extremely flexible, so that's what we give customers. The first thing is simplicity. They can now build applications, you know we have tier one now, automotive customer, CIO coming, meeting us. So you know when I have a project, one year, I need to have hired dozens of people, it's hugely complex, you know. Tell us what's the use case, and we'll build a prototype. >> John: All right, well I'm going to >> One week, we gave them a prototype, and he was amazed how in one week we created an application that analyzed all the streams from the data from the cars, did enrichment, did machine learning, and provided predictions. >> Well we're going to have to come in and test you on this, because I'm skeptical, but here's why. >> Everyone is. >> We'll get to that, I mean I'm probably not skeptical but I kind of am because the history is pretty clear. If you look at some of the big ideas out there, like OpenStack. I mean that thing just morphed into a beast. Hadoop was a cost of ownership nightmare as you mentioned early on. So people have been conceptually correct on what they were trying to do, but trying to get it done was always hard, and then it took a long time to kind of figure out the operational model. So how are you different, if I'm going to play the skeptic here? You know, I've heard this before. How are you different than say OpenStack or Hadoop Clusters, 'cause that was a nightmare, cost of ownership, I couldn't get the type of value I needed, lost my budget. Why aren't you the same? >> Okay, that's interesting. I don't know if you know but I ran a lot of development for OpenStack when I was in Matinox and Hadoop, so I patched a lot of those >> So do you agree with what I said? That that was a problem? >> They are extremely complex, yes. And I think one of the things that first OpenStack tried to bite on too much, and it's sort of a huge tent, everyone tries to push his agenda. OpenStack is still an infrastructure layer, okay. And also Hadoop is sort of a something in between an infrastructure and an application layer, but it was designed 10 years ago, where the problem that Hadoop tried to solve is how do you do web ranking, okay, on tons of batch data. And then the ecosystem evolved into real time, and streaming and machine learning. >> A data warehousing alternative or whatever. >> So it doesn't fit the original model of batch processing, 'cause if an event comes from the car or an IoT device, and you have to do something with it, you need a table with an index. You can't just go and build a huge Parquet file. >> You know, you're talking about complexity >> John: That's why he's different. >> Go ahead. >> So what we've done with our team, after knowing OpenStack and all those >> John: All the scar tissue. >> And all the scar tissues, and my role was also working with all the cloud service providers, so I know their internal architecture, and I worked on SAP HANA and Exodata and all those things, so we learned from the bad experiences, said let's forget about the lower layers, which is what OpenStack is trying to provide, provide you infrastructure as a service. Let's focus on the application, and build from the application all the way to the flash, and the CPU instruction set, and the adapters and the networking, okay. That's what's different. So what we provide is an application and service experience. We don't provide infrastructure. If you go buy VMware and Nutanix, all those offerings, you get infrastructure. Now you go and build with the dozen of dev ops guys all the stack above. You go to Amazon, you get services. Just they're not the most optimized in terms of the implementation because they also have dozens of independent projects that each one takes a VM and starts writing some >> But they're still a good service, but you got to put it together. >> Yeah right. But also the way they implement, because in order for them to scale is that they have a common layer, they found VMs, and then they're starting to build up applications so it's inefficient. And also a lot of it is built on 10-year-old baseline architecture. We've designed it for a very modern architecture, it's all parallel CPUs with 30 cores, you know, flash and NVMe. And so we've avoided a lot of the hardware challenges, and serialization, and just provide and abstraction layer pretty much like a cloud on top. >> Now in terms of abstraction layers in the cloud, they're efficient, and provide a simplification experience for developers. Serverless computing is up and coming, it's an important approach, of course we have the public clouds from AWS and Google and IBM and Microsoft. There are a growing range of serverless computing frameworks for prem-based deployment. I believe you are behind one. Can you talk about what you're doing at iguazio on serverless frameworks for on-prem or public? >> Yes, it's the first time I'm very active in CNC after Cloud Native Foundation. I'm one of the authors of the serverless white paper, which tries to normalize the definitions of all the vendors and come with a proposal for interoperable standard. So I spent a lot of energy on that, 'cause we don't want to lock customers to an API. What's unique, by the way, about our solution, we don't have a single proprietary API. We just emulate all the other guys' stuff. We have all the Amazon APIs for data services, like Kinesis, Dynamo, S3, et cetera. We have the open source APIs, like Kafka. So also on the serverless, my agenda is trying to promote that if I'm writing to Azure or AWS or iguazio, I don't need to change my app. I can use any developer tools. So that's my effort there. And we recently, a few weeks ago, we launched our open source project, which is a sort of second generation of something we had before called Nuclio. It's designed for real time >> John: How do you spell that? >> N-U-C-L-I-O. I even have the logo >> He's got a nice slick here. >> It's really fast because it's >> John: Nuclio, so that's open source that you guys just sponsor and it's all code out in the open? >> All the code is in the open, pretty cool, has a lot of innovative ideas on how to do stream processing and best, 'cause the original serverless functionality was designed around web hooks and HTTP, and even many of the open source projects are really designed around HTTP serving. >> I have a question. I'm doing research for Wikibon on the area of serverless, in fact we've recently published a report on serverless, and in terms of hybrid cloud environments, I'm not seeing yet any hybrid serverless clouds that involve public, you know, serverless like AWS Lambda, and private on-prem deployment of serverless. Do you have any customers who are doing that or interested in hybridizing serverless across public and private? >> Of course, and we have some patents I don't want to go into, but the general idea is, what we've done in Nuclio is also the decoupling of the data from the computation, which means that things can sort of be disjoined. You can run a function in Raspberry Pi, and the data will be in a different place, and those things can sort of move, okay. >> So the persistence has to happen outside the serverless environment, like in the application itself? >> Outside of the function, the function acts as the persistent layer through APIs, okay. And how this data persistence is materialized, that server separate thing. So you can actually write the same function that will run against Kafka or Kinesis or Private MQ, or HTTP without modifying the function, and ad hoc, through what we call function bindings, you define what's going to be the thing driving the data, or storing the data. So that can actually write the same function that does ETL drop from table one to table two. You don't need to put the table information in the function, which is not the thing that Lambda does. And it's about a hundred times faster than Lambda, we do 400,000 events per second in Nuclio. So if you write your serverless code in Nuclio, it's faster than writing it yourself, because of all those low-level optimizations. >> Yaron, thanks for coming on theCUBE. We want to do a deeper dive, love to have you out in Palo Alto next time you're in town. Let us know when you're in Silicon Valley for sure, we'll make sure we get you on camera for multiple sessions. >> And more information re:Invent. >> Go to re:Invent. We're looking forward to seeing you there. Love the continuous analytics message, I think continuous integration is going through a massive renaissance right now, you're starting to see new approaches, and I think things that you're doing is exactly along the lines of what the world wants, which is alternatives, innovation, and thanks for sharing on theCUBE. >> Great. >> That's very great. >> This is theCUBE coverage of the hot startups here at BigData NYC, live coverage from New York, after this short break. I'm John Furrier, Jim Kobielus, after this short break.
SUMMARY :
brought to you by SiliconANGLE Media I'm John Furrier, the cohost this week with Jim Kobielus, We're happy to be here again. and in the middle of all that is the hottest market So since the last time we spoke, we had tons of news. Yeah, so that's the interesting thing. and some portions in the edge or in the enterprise all the analytics tech you could think of. So you're not general purpose, you're just Now it is an appliance and just like the cloud experience when you go to Amazon, So I got to ask you the question, which I can say you are, So the first generation is people that basically, was what you say. Yes, and the boss said, and in the same time generate actions, By the way, depending on which cloud you go to, and that's not something that you can do I mean they can't do that with the existing. and they don't get the same functionality. because of the capability that we can intersect and all the customers get aggravated. A lot of people have semantic problems with real time, but they can get it to them in real time. the average temperature, the average, you know, like the car broke and I need to send a garage, So you know when I have a project, an application that analyzed all the streams from the data Well we're going to have to come in and test you on this, but I kind of am because the history is pretty clear. I don't know if you know but I ran a lot of development is how do you do web ranking, okay, and you have to do something with it, and build from the application all the way to the flash, but you got to put it together. it's all parallel CPUs with 30 cores, you know, Now in terms of abstraction layers in the cloud, So also on the serverless, my agenda is trying to promote I even have the logo and even many of the open source projects on the area of serverless, in fact we've recently and the data will be in a different place, So if you write your serverless code in Nuclio, We want to do a deeper dive, love to have you is exactly along the lines of what the world wants, I'm John Furrier, Jim Kobielus, after this short break.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jim Kobielus | PERSON | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
Bosch | ORGANIZATION | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
John | PERSON | 0.99+ |
John Furrier | PERSON | 0.99+ |
Verizon Ventures | ORGANIZATION | 0.99+ |
Yaron Haviv | PERSON | 0.99+ |
Asia | LOCATION | 0.99+ |
NYC | LOCATION | 0.99+ |
ORGANIZATION | 0.99+ | |
New York City | LOCATION | 0.99+ |
Jim | PERSON | 0.99+ |
Palo Alto | LOCATION | 0.99+ |
30 cores | QUANTITY | 0.99+ |
New York | LOCATION | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
two years | QUANTITY | 0.99+ |
BigData | ORGANIZATION | 0.99+ |
Silicon Valley | LOCATION | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
five years | QUANTITY | 0.99+ |
two problems | QUANTITY | 0.99+ |
Dell EMC | ORGANIZATION | 0.99+ |
Yaron | PERSON | 0.99+ |
One | QUANTITY | 0.99+ |
Dave | PERSON | 0.99+ |
Kafka | TITLE | 0.99+ |
third element | QUANTITY | 0.99+ |
SiliconANGLE Media | ORGANIZATION | 0.99+ |
Dow Jones | ORGANIZATION | 0.99+ |
two things | QUANTITY | 0.99+ |
two racks | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
Grab | ORGANIZATION | 0.99+ |
Nuclio | TITLE | 0.99+ |
two key challenges | QUANTITY | 0.99+ |
Cloud Native Foundation | ORGANIZATION | 0.99+ |
about $33 million | QUANTITY | 0.99+ |
eighth year | QUANTITY | 0.99+ |
Hadoop | TITLE | 0.98+ |
second type | QUANTITY | 0.98+ |
Lambda | TITLE | 0.98+ |
10 years ago | DATE | 0.98+ |
each cloud | QUANTITY | 0.98+ |
Strata Conference | EVENT | 0.98+ |
Equanix | LOCATION | 0.98+ |
10-year-old | QUANTITY | 0.98+ |
first thing | QUANTITY | 0.98+ |
first generation | QUANTITY | 0.98+ |
one | QUANTITY | 0.98+ |
second generation | QUANTITY | 0.98+ |
Hadoop World | EVENT | 0.98+ |
first time | QUANTITY | 0.98+ |
theCUBE | ORGANIZATION | 0.97+ |
Nutanix | ORGANIZATION | 0.97+ |
MemSQL | TITLE | 0.97+ |
each one | QUANTITY | 0.97+ |
2010 | DATE | 0.97+ |
Kinesis | TITLE | 0.97+ |
SAS | ORGANIZATION | 0.96+ |
Wikibon | ORGANIZATION | 0.96+ |
Chicago Mercantile Exchange | ORGANIZATION | 0.96+ |
about two hours | QUANTITY | 0.96+ |
this week | DATE | 0.96+ |
one thing | QUANTITY | 0.95+ |
dozen | QUANTITY | 0.95+ |
Christian Rodatus, Datameer | BigData NYC 2017
>> Announcer: Live from Midtown Manhattan, it's theCUBE covering Big Data New York City 2017. Brought to by SiliconANGLE Media and its ecosystem sponsors. >> Coverage to theCUBE in New York City for Big Data NYC, the hashtag is BigDataNYC. This is our fifth year doing our own event in conjunction with Strata Hadoop, now called Strata Data, used to be Hadoop World, our eighth year covering the industry, we've been there from the beginning in 2010, the beginning of this revolution. I'm John Furrier, the co-host, with Jim Kobielus, our lead analyst at Wikibon. Our next guest is Christian Rodatus, who is the CEO of Datameer. Datameer, obviously, one of the startups now evolving on the, I think, eighth year or so, roughly seven or eight years old. Great customer base, been successful blocking and tackling, just doing good business. Your shirt says show him the data. Welcome to theCUBE, Christian, appreciate it. >> So well established, I barely think of you as a startup anymore. >> It's kind of true, and actually a couple of months ago, after I took on the job, I met Mike Olson, and Datameer and Cloudera were sort of founded the same year, I believe late 2009, early 2010. Then, he told me there were two open source projects with MapReduce and Hadoop, basically, and Datameer was founded to actually enable customers to do something with it, as an entry platform to help getting data in, create the data and doing something with it. And now, if you walk the show floor, it's a completely different landscape now. >> We've had you guys on before, the founder, Stefan, has been on. Interesting migration, we've seen you guys grow from a customer base standpoint. You've come on as the CEO to kind of take it to the next level. Give us an update on what's going on at Datameer. Obviously, the shirt says "Show me the data." Show me the money kind of play there, I get that. That's where the money is, the data is where the action is. Real solutions, not pie in the sky, we're now in our eighth year of this market, so there's not a lot of tolerance for hype even though there's a lot of AI watching going on. What's going on with you guys? >> I would say, interesting enough I met with a customer, prospective customer, this morning, and this was a very typical organization. So, this is a customer that was an insurance company, and they're just about to spin up their first Hadoop cluster to actually work on customer management applications. And they are overwhelmed with what the market offers now. There's 27 open source projects, there's dozens and dozens of other different tools that try to basically, they try best of reach approaches and certain layers of the stack for specific applications, and they don't really know how to stitch this all together. And if I reflect from a customer meeting at a Canadian bank recently that has very successfully deployed applications on the data lake, like in fraud management and compliance applications and things like this, they still struggle to basically replicate the same performance and the service level agreements that they used from their old EDW that they still have in production. And so, everybody's now going out there and trying to figure out how to get value out of the data lake for the business users, right? There's a lot of approaches that these companies are trying. There's SQL-on-Hadoop that supposedly doesn't perform properly. There is other solutions like OLAP on Hadoop that tries to emulate what they've been used to from the EDWs, and we believe these are the wrong approaches, so we want to stay true to the stack and be native to the stack and offer a platform that really operates end-to-end from interesting the data into the data lake to creation, preparation of the data, and ultimately, building the data pipelines for the business users, and this is certainly something-- >> Here's more of a play for the business users now, not the data scientists and statistical modelers. I thought the data scientists were your core market. Is that not true? >> So, our primary user base as Datameer used to be like, until last week, we were the data engineers in the companies, or basically the people that built the data lake, that created the data and built these data pipelines for the business user community no matter what tool they were using. >> Jim, I want to get your thoughts on this for Christian's interest. Last year, so these guys can fix your microphone. I think you guys fix the microphone for us, his earpiece there, but I want to get a question to Chris, and I ask to redirect through you. Gartner, another analyst firm. >> Jim: I've heard of 'em. >> Not a big fan personally, but you know. >> Jim: They're still in business? >> The magic quadrant, they use that tool. Anyway, they had a good intro stat. Last year, they predicted through 2017, 60% of big data projects will fail. So, the question for both you guys is did that actually happen? I don't think it did, I'm not hearing that 60% have failed, but we are seeing the struggle around analytics and scaling analytics in a way that's like a dev ops mentality. So, thoughts on this 60% data projects fail. >> I don't know whether it's 60%, there was another statistic that said there's only 14% of Hadoop deployments, or production or something, >> They said 60, six zero. >> Or whatever. >> Define failure, I mean, you've built a data lake, and maybe you're not using it immediately for any particular application. Does that mean you've failed, or does it simply mean you haven't found the killer application yet for it? I don't know, your thoughts. >> I agree with you, it's probably not a failure to that extent. It's more like how do they, so they dump the data into it, right, they build the infrastructure, now it's about the next step data lake 2.0 to figure out how do I get value out of the data, how do I go after the right applications, how do I build a platform and tools that basically promotes the use of that data throughout the business community in a meaningful way. >> Okay, so what's going on with you guys from a product standpoint? You guys have some announcements. Let's get to some of the latest and greatest. >> Absolutely. I think we were very strong in data creation, data preparation and the entire data governance around it, and we are using, as a user interface, we are using this spreadsheet-like user interface called a workbook, it really looks like Excel, but it's not. It operates at completely different scale. It's basically an Excel spreadsheet on steroids. Our customers built a data pipeline, so this is the data engineers that we discussed before, but we also have a relatively small power user community in our client base that use that spreadsheet for deep data exploration. Now, we are lifting this to the next level, and we put up a visualization layer on top of it that runs natively in the stack, and what you get is basically a visual experience not only in the data curation process but also in deep data exploration, and this is combined with two platform technologies that we use, it's based on highly scalable distributed search in the backend engine of our product, number one. We have also adopted a columnar data store, Parquet, for our file system now. In this combination, the data exploration capabilities we bring to the market will allow power analysts to really dig deep into the data, so there's literally no limits in terms of the breadth and the depth of the data. It could be billions of rows, it could be thousands of different attributes and columns that you are looking at, and you will get a response time of sub-second as we create indices on demand as we run this through the analytic process. >> With these fast queries and visualization, do you also have the ability to do semantic data virtualization roll-ups across multi-cloud or multi-cluster? >> Yeah, absolutely. We, also there's a second trend that we discussed right before we started the live transmission here. Things are also moving into the cloud, so what we are seeing right now is the EDW's not going away, the on prem is data lake, that prevail, right, and now they are thinking about moving certain workload types into the cloud, and we understand ourselves as a platform play that builds a data fabric that really ties all these data assets together, and it enables business. >> On the trends, we weren't on camera, we'll bring it up here, the impact of cloud to the data world. You've seen this movie before, you have extensive experience in this space going back to the origination, you'd say Teradata. When it was the classic, old-school data warehouse. And then, great purpose, great growth, massive value creation. Enter the Hadoop kind of disruption. Hadoop evolved from batch to do ranking stuff, and then tried to, it was a hammer that turned into a lawnmower, right? Then they started going down the path, and really, it wasn't workable for what people were looking at, but everyone was still trying to be the Teradata of whatever. Fast forward, so things have evolved and things are starting to shake out, same picture of data warehouse-like stuff, now you got cloud. It seems to be changing the nature of what it will become in the future. What's your perspective on that evolution? What's different about now and what's same about now that's, from the old days? What's the similarities of the old-school, and what's different that people are missing? >> I think it's a lot related to cloud, just in general. It is extremely important to fast adoptions throughout the organization, to get performance, and service-level agreements without customers. This is where we clearly can help, and we give them a user experience that is meaningful and that resembles what they were used to from the old EDW world, right? That's number one. Number two, and this comes back to a question to 60% fail, or why is it failing or working. I think there's a lot of really interesting projects out, and our customers are betting big time on the data lake projects whether it being on premise or in the cloud. And we work with HSBC, for instance, in the United Kingdom. They've got 32 data lake projects throughout the organization, and I spoke to one of these-- >> Not 32 data lakes, 32 projects that involve tapping into the data lake. >> 32 projects that involve various data lakes. >> Okay. (chuckling) >> And I spoke to one of the chief data officers there, and they said they are data center infrastructure just by having kick-started these projects will explode. And they're not in the business of operating all the hardware and things like this, and so, a major bank like them, they made an announcement recently, a public announcement, you can read about it, started moving the data assets into the cloud. This is clearly happening at rapid pace, and it will change the paradigm in terms of breathability and being able to satisfy peak workload requirements as they come up, when you run a compliance report at quota end or something like this, so this will certainly help with adoption and creating business value for our customers. >> We talk about all the time real-time, and there's so many examples of how data science has changed the game. I mean, I was talking about, from a cyber perspective, how data science helped capture Bin Laden to how I can get increased sales to better user experience on devices. Having real-time access to data, and you put in some quick data science around things, really helps things in the edge. What's your view on real-time? Obviously, that's super important, you got to kind of get your house in order in terms of base data hygiene and foundational work, building blocks. At the end of the day, the real-time seems to be super hot right now. >> Real-time is a relative term, right, so there's certainly applications like IOT applications, or machine data that you analyze that require real-time access. I would call it right-time, so what's the increment of data load that is required for certain applications? We are certainly not a real-time application yet. We can possibly load data through Kafka and stream data through Kafka, but in general, we are still a batch-oriented platform. We can do. >> Which, by the way, is not going away any time soon. It's like super important. >> No, it's not going away at all, right. It can do many batches at relatively frequent increments, which is usually enough for what our customers demand from our platform today, but we're certainly looking at more streaming types of capability as we move this forward. >> What do the customer architectures look like? Because you brought up the good point, we talk about this all the time, batch versus real-time. They're not mutually exclusive, obviously, good architectures would argue that you decouple them, obviously will have a good software elements all through the life cycle of data. >> Through the stack. >> And have the stack, and the stack's only going to get more robust. Your customers, what's the main value that you guys provide them, the problem that you're solving today and the benefits to them? >> Absolutely, so our true value is that there's no breakages in the stack. We enter, and we can basically satisfy all requirements from interesting the data, from blending and integrating the data, preparing the data, building the data pipelines, and analyzing the data. And all this we do in a highly secure and governed environment, so if you stitch it together, as a customer, the customer this morning asked me, "Whom do you compete with?" I keep getting this question all the time, and we really compete with two things. We compete with build-your-own, which customers still opt to do nowadays, while our things are really point and click and highly automated, and we compete with a combination of different products. You need to have at least three to four different products to be able to do what we do, but then you get security breaks, you get lack of data lineage and data governance through the process, and this is the biggest value that we can bring to the table. And secondly now with visual exploration, we offer capability that literally nobody has in the marketplace, where we give power users the capability to explore with blazing fast response times, billion rows of data in a very free-form type of exploration process. >> Are there more power users now than there were when you started as a company? It seemed like tools like Datameer have brought people into the sort of power user camp, just simply by the virtue of having access to your tool. What are your thoughts there? >> Absolutely, it's definitely growing, and you see also different companies exploiting their capability in different ways. You might find insurance or financial services customers that have a very sophisticated capability building in that area, and you might see 1,000 to 2,000 users that do deep data exploration, and other companies are starting out with a couple of dozen and then evolving it as they go. >> Christian, I got to ask you as the new CEO of Datameer, obviously going to the next level, you guys have been successful. We were commenting yesterday on theCUBE about, we've been covering this for eight years in depth in terms of CUBE coverage, we've seen the waves come and go of hype, but now there's not a lot of tolerance for hype. You guys are one of the companies, I will say, that stay to your knitting, you didn't overplay your hand. You've certainly rode the hype like everyone else did, but your solution is very specific on value, and so, you didn't overplay your hand, the company didn't really overplay their hand, in my opinion. But now, there's really the hand is value. >> Absolutely. >> As the new CEO, you got to kind of put a little shiny new toy on there, and you know, rub the, keep the car lookin' shiny and everything looking good with cutting edge stuff, the same time scaling up what's been working. The question is what are you doubling down on, and what are you investing in to keep that innovation going? >> There's really three things, and you're very much right, so this has become a mature company. We've grown with our customer base, our enterprise features and capabilities are second to none in the marketplace, this is what our customers achieve, and now, the three investment areas that we are putting together and where we are doubling down is really visual exploration as I outlined before. Number two, hybrid cloud architectures, we don't believe the customers move their entire stack right into the cloud. There's a few that are going to do this and that are looking into these things, but we will, we believe in the idea that they will still have to EDW their on premise data lake and some workload capabilities in the cloud which will be growing, so this is investment area number two. Number three is the entire concept of data curation for machine learning. This is something where we've released a plug-in earlier in the year for TensorFlow where we can basically build data pipelines for machine learning applications. This is still very small. We see some interest from customers, but it's growing interest. >> It's a directionally correct kind of vector, you're looking and say, it's a good sign, let's kick the tires on that and play around. >> Absolutely. >> 'Cause machine learning's got to learn, too. You got to learn from somewhere. >> And quite frankly, deep learning, machine learning tools for the rest of us, there aren't really all that many for the rest of us power users, they're going to have to come along and get really super visual in terms of enabling visual modular development and tuning of these models. What are your thoughts there in terms of going forward about a visualization layer to make machine learning and deep learning developers more productive? >> That is an area where we will not engage in a way. We will stick with our platform play where we focus on building the data pipelines into those tools. >> Jim: Gotcha. >> In the last area where we invest is ecosystem integration, so we think with our visual explorer backend that is built on search and on a Parquet file format is, or columnar store, is really a key differentiator in feeding or building data pipelines into the incumbent BRE ecosystems and accelerating those as well. We've currently prototypes running where we can basically give the same performance and depth of analytic capability to some of the existing BI tools that are out there. >> What are some the ecosystem partners do you guys have? I know partnering is a big part of what you guys have done. Can you name a few? >> I mean, the biggest one-- >> Everybody, Switzerland. >> No, not really. We are focused on staying true to our stack and how we can provide value to our customers, so we work actively and very important on our cloud strategy with Microsoft and Amazon AWS in evolving our cloud strategy. We've started working with various BI vendors throughout that you know about, right, and we definitely have a play also with some of the big SIs and IBM is a more popular one. >> So, BI guys mostly on the tool visualization side. You said you were a pipeline. >> On tool and visualization side, right. We have very effective integration for our data pipelines into the BI tools today we support TD for Tableau, we have a native integration. >> Why compete there, just be a service provider. >> Absolutely, and we have more and better technology come up to even accelerate those tools as well in our big data stuff. >> You're focused, you're scaling, final word I'll give to you for the segment. Share with the folks that are a Datameer customer or have not yet become a customer, what's the outlook, what's the new Datameer look like under your leadership? What should they expect? >> Yeah, absolutely, so I think they can expect utmost predictability, the way how we roll out the division and how we build our product in the next couple of releases. The next five, six months are critical for us. We have launched Visual Explorer here at the conference. We're going to launch our native cloud solution probably middle of November to the customer base. So, these are the big milestones that will help us for our next fiscal year and provide really great value to our customers, and that's what they can expect, predictability, a very solid product, all the enterprise-grade features they need and require for what they do. And if you look at it, we are really enterprise play, and the customer base that we have is very demanding and challenging, and we want to keep up and deliver a capability that is relevant for them and helps them create values from the data lakes. >> Christian Rodatus, technology enthusiast, passionate, now CEO of Datameer. Great to have you on theCUBE, thanks for sharing. >> Thanks so much. >> And we'll be following your progress. Datameer here inside theCUBE live coverage, hashtag BigDataNYC, our fifth year doing our own event here in conjunction with Strata Data, formerly Strata Hadoop, Hadoop World, eight years covering this space. I'm John Furrier with Jim Kobielus here inside theCUBE. More after this short break. >> Christian: Thank you. (upbeat electronic music)
SUMMARY :
Brought to by SiliconANGLE Media and its ecosystem sponsors. I'm John Furrier, the co-host, with Jim Kobielus, So well established, I barely think of you create the data and doing something with it. You've come on as the CEO to kind of and the service level agreements that they used Here's more of a play for the business users now, that created the data and built these data pipelines and I ask to redirect through you. So, the question for both you guys is the killer application yet for it? the next step data lake 2.0 to figure out Okay, so what's going on with you guys and columns that you are looking at, and we understand ourselves as a platform play the impact of cloud to the data world. and that resembles what they were used to tapping into the data lake. and being able to satisfy peak workload requirements and you put in some quick data science around things, or machine data that you analyze Which, by the way, is not going away any time soon. more streaming types of capability as we move this forward. What do the customer architectures look like? and the stack's only going to get more robust. and analyzing the data. just simply by the virtue of having access to your tool. and you see also different companies and so, you didn't overplay your hand, the company and what are you investing in to keep that innovation going? and now, the three investment areas let's kick the tires on that and play around. You got to learn from somewhere. for the rest of us power users, We will stick with our platform play and depth of analytic capability to some of What are some the ecosystem partners do you guys have? and how we can provide value to our customers, on the tool visualization side. into the BI tools today we support TD for Tableau, Absolutely, and we have more and better technology Share with the folks that are a Datameer customer and the customer base that we have is Great to have you on theCUBE, here in conjunction with Strata Data, Christian: Thank you.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Jim Kobielus | PERSON | 0.99+ |
Chris | PERSON | 0.99+ |
HSBC | ORGANIZATION | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
Jim | PERSON | 0.99+ |
Christian Rodatus | PERSON | 0.99+ |
Stefan | PERSON | 0.99+ |
IBM | ORGANIZATION | 0.99+ |
John Furrier | PERSON | 0.99+ |
60% | QUANTITY | 0.99+ |
2017 | DATE | 0.99+ |
Datameer | ORGANIZATION | 0.99+ |
2010 | DATE | 0.99+ |
32 projects | QUANTITY | 0.99+ |
Last year | DATE | 0.99+ |
United Kingdom | LOCATION | 0.99+ |
1,000 | QUANTITY | 0.99+ |
New York City | LOCATION | 0.99+ |
14% | QUANTITY | 0.99+ |
eight years | QUANTITY | 0.99+ |
fifth year | QUANTITY | 0.99+ |
one | QUANTITY | 0.99+ |
Cloudera | ORGANIZATION | 0.99+ |
Excel | TITLE | 0.99+ |
eighth year | QUANTITY | 0.99+ |
late 2009 | DATE | 0.99+ |
early 2010 | DATE | 0.99+ |
Mike Olson | PERSON | 0.99+ |
60 | QUANTITY | 0.99+ |
27 open source projects | QUANTITY | 0.99+ |
last week | DATE | 0.99+ |
thousands | QUANTITY | 0.99+ |
two things | QUANTITY | 0.99+ |
Kafka | TITLE | 0.99+ |
seven | QUANTITY | 0.99+ |
second trend | QUANTITY | 0.99+ |
Midtown Manhattan | LOCATION | 0.99+ |
yesterday | DATE | 0.99+ |
Christian | PERSON | 0.99+ |
both | QUANTITY | 0.99+ |
SiliconANGLE Media | ORGANIZATION | 0.98+ |
two open source projects | QUANTITY | 0.98+ |
Gartner | ORGANIZATION | 0.98+ |
two platform technologies | QUANTITY | 0.98+ |
Wikibon | ORGANIZATION | 0.98+ |
Switzerland | LOCATION | 0.98+ |
billions of rows | QUANTITY | 0.98+ |
first | QUANTITY | 0.98+ |
MapReduce | ORGANIZATION | 0.98+ |
2,000 users | QUANTITY | 0.98+ |
Bin Laden | PERSON | 0.98+ |
NYC | LOCATION | 0.97+ |
Strata Data | ORGANIZATION | 0.97+ |
32 data lakes | QUANTITY | 0.97+ |
six | QUANTITY | 0.97+ |
Hadoop | TITLE | 0.97+ |
secondly | QUANTITY | 0.96+ |
next fiscal year | DATE | 0.96+ |
three things | QUANTITY | 0.96+ |
today | DATE | 0.95+ |
four different products | QUANTITY | 0.95+ |
Teradata | ORGANIZATION | 0.95+ |
Christian | ORGANIZATION | 0.95+ |
this morning | DATE | 0.95+ |
TD | ORGANIZATION | 0.94+ |
EDW | ORGANIZATION | 0.94+ |
BigData | EVENT | 0.92+ |
Josh Klahr & Prashanthi Paty | DataWorks Summit 2017
>> Announcer: Live from San Jose, in the heart of Silicon Valley, it's theCUBE, covering DataWorks Summit 2017. Brought to you by Hortonworks. >> Hey, welcome back to theCUBE. Day two of the DataWorks Summit, I'm Lisa Martin with my cohost, George Gilbert. We've had a great day and a half so far, learning a ton in this hyper-growth, big data world meets IoT, machine learning, data science. George and I are excited to welcome our next guests. We have Josh Klahr, the VP of Product Management from AtScale. Welcome George, welcome back. >> Thank you. >> And we have Prashanthi Paty, the Head of Data Engineering for GoDaddy. Welcome to theCUBE. >> Thank you. >> Great to have you guys here. So, wanted to kind of talk to you guys about, one, how you guys are working together, but two, also some of the trends that you guys are seeing. So as we talked about, in the tech industry, it's two degrees of Kevin Bacon, right. You guys worked together back in the day at Yahoo. Talk to us about what you both visualized and experienced in terms of the Hadoop adoption maturity cycle. >> Sure. >> You want to start, Josh? >> Yeah, I'll start, and you can chime in and correct me. But yeah, as you mentioned, Prashanthi and I worked together at Yahoo. It feels like a long time ago. In our central data group. And we had two main jobs. First job was, collect all of the data from our ad systems, our audience systems, and stick that data into a Hadoop cluster. At the time, we were kind of doing it while Hadoop was kind of being developed. And the other thing that we did was, we had to support a bunch of BI consumers. So we built cubes, we built data marts, we used MicroStrategy, Tableau, and I would say the experience there was a great experience with Hadoop in terms of the ability to have low-cost storage, scale out data processing of all of, what were really, billions and billions, tens of billions of events a day. But when it came to BI, it felt like we were doing stuff the old way. And we were moving data off cluster, and making it small. In fact, you did a lot of that. >> Well, yeah, at the end of the day, we were using Hadoop as a staging layer. So we would process a whole bunch of data there, and then we would scale it back, and move it into, again, relational stores or cubes, because basically we couldn't afford to give any accessibility to BI tools or to our end users directly on Hadoop. So while we surely did a large-scale data processing in Hadoop layer, we failed to turn on the insights right there. >> Lisa: Okay. >> Maybe there's a lesson in there for folks who are getting slightly more mature versions of Hadoop now, but can learn from also some of the experiences you've had. Were there issues in terms of, having cleaned and curated data, were there issues for BI with performance and the lack of proper file formats like Parquet? What was it that where you hit the wall? >> It was both, you have to remember this, we were probably one of the first teams to put a data warehouse on Hadoop. So we were dealing with Pig versions of like, 0.5, 0.6, so we were putting a lot of demand on the tooling and the infrastructure. Hadoop was still in a very nascent stage at that time. That was one. And I think a lot of the focus was on, hey, now we have the ability to do clickstream analytics at scale, right. So we did a lot of the backend stuff. But the presentation is where I think we struggled. >> So would that mean that you did do, the idea is that you could do full resolution without sampling on the backend, and then you would extract and presumably sort of denormalize so that you could, essentially run data match for subject matter interests. >> Yeah, and that's exactly what we did is, we took all of this big data, but to make it work for BI, which were two things, one was performance. It was really, can you get an interactive query and response time. And the other thing was the interface. Can a Tableau user connect and understand what they're looking at. You had to make the data small again. And that was actually the genesis of AtScale, which is where I am today, was, we were frustrated with this, big data platform and having to then make the data small again in order to support BI. >> That's a great transition, Josh. Let's actually talk about AtScale. You guys saw BI on Hadoop as this big white space. How have you succeeded there, and then let's talk about what GoDaddy is doing with AtScale and big data. >> Yeah, I think that we definitely learned, we took the learnings from our experience at Yahoo, and we really thought about, if we were to start from scratch, and solve the problem the way we wanted it to be solved, what would that system look like. And it was a few things. One was an interface that worked for BI. I don't want to date myself, but my experience in the software space started with OLAP. And I can tell you OLAP isn't dead. When you go and talk to an enterprise, a fortune 1000 enterprise and you talk about OLAP, that's how they think. They think in terms of measures and dimensions and hierarchies. So one important thing for us was to project an OLAP interface on top of data that's Hadoop native. It's Hive tables, Parquet, ORC, you kind of talk about all of the mess that may sit underneath the covers. So one thing was projecting that interface, the other thing was delivering performance. So we've invested a lot in using the Hadoop cluster natively to deliver performing queries. We do this by creating aggregate tables and summary tables and being smart about how we route queries. But we've done it in a way that makes a Hadoop admin very happy. You don't have to buy a bunch of AtScale servers in addition to your Hadoop cluster. We scale the way the Hadoop cluster scales. So we don't require separate technology. So we fit really nicely into that Hadoop ecosystem. >> So how do you make, making the Hadoop admin happy is a good thing. How do you make the business user happy, who needs now, as we were here yesterday, to kind of merge more with the data science folks to be able to understand or even have the chance to articulate, "These are the business outcomes "we want to look for and we want to see." How do you guys, maybe, under the hood, if you will, AtScale, make the business guys and gals happy? >> I'll share my opinion and then Prashanthi can comment on her experience but, as I've mentioned before, the business users want an interface that's simple to use. And so that's one thing we do, is, we give them the ability to just look at measures and dimensions. If I'm a business, I grew up using Excel to do my analysis. The thing I like most as an analyst is a big fat wide table. And so that's what, we make an underlying Hadoop cluster and what could be tens or hundreds of tables look like a single big fat wide table for a data analyst. You talk to a data scientist, you talk to a business analyst, that's the way they want to view the world. So that's one thing we do. And then, we give them response times that are fast. We give them interactivity, so that you could really quickly start to get a sense of the shape of the data. >> And allowing them to get that time to value. >> Yes. >> I can imagine. >> Just a follow-up on that. When you have to prepare the aggregates, essentially like the cubes, instead of the old BI tools running on a data mart, what is the additional latency that's required from data coming fresh into the data lake and then transforming it into something that's consumption ready for the business user? >> Yeah, I think I can take that. So again, if you look at the last 10 years, in the initial period, certainly at Yahoo, we just threw engineering resources at that problem, right. So we had teams dedicated to building these aggregates. But the whole premise of Hadoop was the ability to do unstructured optimizations. And by having a team find out the new data coming in and then integrating that into your pipeline, so we were adding a lot of latency. And so we needed to figure out how we can do this in a more seamless way, in a more real-time way. And get the, you know, the real premise of Hadoop. Get it at the hands of our business users. I mean, I think that's where AtScale is doing a lot of the good work in terms of dynamically being able to create aggregates based on the design that you put in the cube. So we are starting to work with them on our implementation. We're looking forward to the results. >> Tell us a little bit more about what you're looking to achieve. So GoDaddy is a customer of AtScale. Tell us a little bit more about that. What are you looking to build together, and kind of, where are you in your journey right now? >> Yeah, so the main goal for us is to move beyond predefined models, dashboards, and reports. So we want to be more agile with our schema changes. Time to market is one. And performance, right. Ability to put BI tools directly on top of Hadoop, is one. And also to push as much of the semantics as possible down into the Hadoop layer. So those are the things that we're looking to do. >> So that sounds like a classic business intelligence component, but sort of rethought for a big data era. >> I love that quote, and I feel it. >> Prashanthi: Yes. >> Josh: Yes. (laughing) >> That's exactly what we're trying to do. >> But that's also, some of the things you mentioned are non-trivial. You want to have this, time goes in to the pre-processing of data so that it's consumable, but you also wanted it to be dynamic, which is sort of a trade-off, which means, you know, that takes time. So is that a sort of a set of requirements, a wishlist for AtScale, or is that something that you're building on your own? >> I think there's a lot happening in that space. They are one of the first people to come out with their product, which is solving a real problem that we tried to solve for a long time. And I think as we start using them more and more, we'll surely be pushing them to bring in more features. I think the algorithm that they have to dynamically generate aggregates is something that we're giving quite a lot of feedback to them on. >> Our last guest from Pentaho was talking about, there was, in her keynote today, the quote from I think McKinsey report that said, "40% of machine learning data is either not fully "exploited or not used at all." So, tell us, kind of, where is big daddy regarding machine learning? What are you seeing? What are you seeing at AtScale and how are you guys going to work together to maybe venture into that frontier? >> Yeah, I mean, I think one of the key requirements we're placing on our data scientists is, not only do you have to be very good at your data science job, you have to be a very good programmer too to make use of the big data technologies. And we're seeing some interesting developments like very workload-specific engines coming into the market now for search, for graph, for machine learning, as well. Which is supposed to give the tools right into the hands of data scientists. I personally haven't worked with them to be able to comment. But I do think that the next realm on big data is this workload-specific engines, and coming on top of Hadoop, and realizing more of the insights for the end users. >> Curious, can you elaborate a little more on those workload-specific engines, that sounds rather intriguing. >> Well, I think interactive, interacting with Hadoop on a real-time basis, we see search-based engines like Elasticsearch, Solr, and there is also Druid. At Yahoo, we were quite a bit shop of Druid actually. And we were using it as an interactive query layer directly with our applications, BI applications. This is our JavaScript-based BI applications, and Hadoop. So I think there are quite a few means to realize insights from Hadoop now. And that's the space where I see workload-specific engines coming in. >> And you mentioned earlier before we started that you were using Mahout, presumably for machine learning. And I guess I thought the center of gravity for that type of analytics has moved to Spark, and you haven't mentioned Spark yet. We are not using Mahout though. I mentioned it as something that's in that space. But yeah, I mean, Spark is pretty interesting. Spark SQL, doing ETL with Spark, as well as using Spark SQL for queries is something that looks very, very promising lately. >> Quick question for you, from a business perspective, so you're the Head of Engineering at GoDaddy. How do you interact with your business users? The C-suite, for example, where data science, machine learning, they understand, we have to have, they're embracing Hadoop more and more. They need to really, embracing big data and leveraging Hadoop as an enabler. What's the conversation like, or maybe even the influence of the GoDaddy business C-suite on engineering? How do you guys work collaboratively? >> So we do have very regular stakeholder meeting. And these are business stakeholders. So we have representatives from our marketing teams, finance, product teams, and data science team. We consider data science as one of our customers. We take requirements from them. We give them peek into the work we're doing. We also let them be part of our agile team so that when we have something released, they're the first ones looking at it and testing it. So they're very much part of the process. I don't think we can afford to just sit back and work on this monolithic data warehouse and at the end of the day say, "Hey, here is what we have" and ask them to go get the insights from it. So it's a very agile process, and they're very much part of it. >> One last question for you, sorry George, is, you guys mentioned you are sort of early in your partnership, unless I misunderstood. What has AtScale help GoDaddy achieve so far and what are your expectations, say the next six months? >> We want the world. (laughing) >> Lisa: Just that. >> Yeah, but the premise is, I mean, so Josh and I, we were part of the same team at Yahoo, where we faced problems that AtScale is trying to solve. So the premise of being able to solve those problems, which is, like their name, basically delivering data at scale, that's the premise that I'm very much looking forward to from them. >> Well, excellent. Well, we want to thank you both for joining us on theCUBE. We wish you the best of luck in attaining the world. (all laughing) >> Josh: There we go, thank you. >> Excellent, guys. Josh Klahr, thank you so much. >> My pleasure. Prashanthi, thank you for being on theCUBE for the first time. >> No problem. >> You've been watching theCUBE live at the day two of the DataWorks Summit. For my cohost George Gilbert, I am Lisa Martin. Stick around guys, we'll be right back. (jingle)
SUMMARY :
Brought to you by Hortonworks. George and I are excited to welcome our next guests. And we have Prashanthi Paty, Talk to us about what you both visualized and experienced And the other thing that we did was, and then we would scale it back, and the lack of proper file formats like Parquet? So we were dealing with Pig versions of like, the idea is that you could do full resolution And the other thing was the interface. How have you succeeded there, and solve the problem the way we wanted it to be solved, So how do you make, And so that's one thing we do, is, that's consumption ready for the business user? based on the design that you put in the cube. and kind of, where are you in your journey right now? So we want to be more agile with our schema changes. So that sounds like a classic business intelligence Josh: Yes. of data so that it's consumable, but you also wanted And I think as we start using them more and more, What are you seeing at AtScale and how are you guys and realizing more of the insights for the end users. Curious, can you elaborate a little more And we were using it as an interactive query layer and you haven't mentioned Spark yet. machine learning, they understand, we have to have, and at the end of the day say, "Hey, here is what we have" you guys mentioned you are sort of early We want the world. So the premise of being able to solve those problems, Well, we want to thank you both for joining us on theCUBE. Josh Klahr, thank you so much. for the first time. of the DataWorks Summit.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Josh | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Lisa Martin | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Josh Klahr | PERSON | 0.99+ |
Prashanthi Paty | PERSON | 0.99+ |
Prashanthi | PERSON | 0.99+ |
Lisa | PERSON | 0.99+ |
Yahoo | ORGANIZATION | 0.99+ |
Kevin Bacon | PERSON | 0.99+ |
San Jose | LOCATION | 0.99+ |
Excel | TITLE | 0.99+ |
Silicon Valley | LOCATION | 0.99+ |
GoDaddy | ORGANIZATION | 0.99+ |
40% | QUANTITY | 0.99+ |
yesterday | DATE | 0.99+ |
AtScale | ORGANIZATION | 0.99+ |
tens | QUANTITY | 0.99+ |
Spark | TITLE | 0.99+ |
Druid | TITLE | 0.99+ |
First job | QUANTITY | 0.99+ |
Hadoop | TITLE | 0.99+ |
two | QUANTITY | 0.99+ |
Spark SQL | TITLE | 0.99+ |
today | DATE | 0.99+ |
two degrees | QUANTITY | 0.99+ |
both | QUANTITY | 0.98+ |
one | QUANTITY | 0.98+ |
DataWorks Summit | EVENT | 0.98+ |
two things | QUANTITY | 0.98+ |
Elasticsearch | TITLE | 0.98+ |
first time | QUANTITY | 0.98+ |
DataWorks Summit 2017 | EVENT | 0.97+ |
first teams | QUANTITY | 0.96+ |
Solr | TITLE | 0.96+ |
Mahout | TITLE | 0.95+ |
hundreds of tables | QUANTITY | 0.95+ |
two main jobs | QUANTITY | 0.94+ |
One last question | QUANTITY | 0.94+ |
billions and | QUANTITY | 0.94+ |
McKinsey | ORGANIZATION | 0.94+ |
Day two | QUANTITY | 0.94+ |
One | QUANTITY | 0.94+ |
Parquet | TITLE | 0.94+ |
Tableau | TITLE | 0.93+ |
John Cavanaugh, HP - #SparkSummit - #theCUBE
>> Announcer: Live from San Francisco, it's theCube, covering Spark Summit 2017, brought to you by Databricks. >> Welcome back to theCube at Spark Summit 2017. I don't know about you, George, I'm having a great time learning from all of our attendees. >> We've been absorbing now for almost two days. >> Yeah, well, and we're about to absorb a little bit more here, too, because the next guest, I looking forward to, I saw his name on the schedule, all right, that's the guy who talks about herding cats, it's John Cavanaugh, Master Architect from HP. John, welcome to the show. >> Great, thanks for being here. >> Well, I did see, I don't know if it's about cats in the Internet, but either cats or self-driving cars, one of the two in analogies. But talk to us about your session. Why did you call it Herding Cats, and is that related to maybe the organization at HP? >> Yeah, there's a lot of organizational dynamics as part of our migration at Spark. HP is a very distributed organization, and it has had a lot of distributed autonomy, so, you know, trying to get centralized activity is often a little challenging. You guys have often heard, you know, I am from the government, I'm here to help. That's often the kind of shields-up response you will get from folks, so we got a lot of dynamics in terms of trying to bring these distributed organizations on board to a new common platform, and a allay many of the fears that they had with making any kind of a change. >> So, are you centered at a specific division? >> So, yes, I'm the print platforms and future technology group. You know, there's two large business segments with HP. There's our personal systems group that produces everything from phones to business PCs to high-end gaming. But I'm in the printing group, and while many people are very familiar with your standard desktop printer, you know, the printers we sell really vary from a very small product we call Sprocket, it fits in your hand, battery-operated, to literally a web press that's bigger than your house and prints at hundreds of feet per minute. So, it's a very wide product line, and it has a lot of data collection. >> David: Do you have 3D printing as well? >> We do have 3D printing as well. That's an emergent area for us. I'm not super familiar with that. I'm mostly on the 2D side, but that's a very exciting space as well. >> So tell me about what kind of projects that you're working on that do require that kind of cross-team or cross-departmental cooperation. >> So, you know, in my talk, I talked about the Wild West Era of Big Data, and that was prior to 2015, and we had a lot of groups that were standing up all kinds of different big data infrastructures. And part of this stems from the fact that we were part of HP at the time, and we could buy servers and racks of servers at cost. Storage was cheap, all these things, so they sprouted up everywhere. And, around 2015, everybody started realizing, oh my God, this is completely fragmented. How do we pull things back together? And that's when a lot of groups started trying to develop platformish types of activities, and that's where we knew we needed to go, but there was even some disagreement from different groups, how do we move forward. So, there's been a lot of good work within HP in terms of creating a virtual community, and Spark really kind of caught on pretty quickly. Many people were really tired of kind of Hadoop. There were a lot of very opinionated models in Hadoop, where Spark opens up a lot more into the data science community. So, that went really well, and we made a big push into AWS for much of our cloud activities, and we really ended up then pretty quickly with Databricks as an enterprise partner for us. >> And so, George, you've done a lot of research. I'm sure you talked to enterprise companies along the way. Is this a common issue with big enterprises? >> Well, for most big data projects they've started, the ones we hear a lot about is there's a mandate from the CIO, we need a big data strategy, and so some of those, in the past, stand up five or 10-node Hadoop cluster and run some sort of pilot and say, this is our strategy. But is sounds like you herded a lot of cats... >> We had dozens of those small Hadoop clusters all around the company. (laughter) >> So, how did you go about converting that energy, that excess energy towards something more harmonized around Databricks? >> Well, a lot of people started recognizing we had a problem, and this really wasn't going to scale, and we really needed to come up with a broader way to share things across the organization. So, the timing was really right, and a lot of people were beginning to understand that. And, you know, we said for us, probably about five different kind of key decisions we ended up making. And part of the whole strategy was to empower the businesses. As I have mentioned, we are a very distributed organization, so, you can't really dictate the businesses. The businesses really need the owners' success. And one of the decisions that was made, it might be kind of controversial for many CIOs, is that we've made a big push on cloud-hosted and business-owned, not IT-owned. And one of the real big reasons for that is we were no longer viewing data and big data as kind of a business-intelligence activity or a standardized reporting activity. We really knew that, to be successful moving forward, is needed to be built into our products and services, and those products and services are managed by the businesses. So, it can't be something that would be tossed off to an IT organization. >> So that the IT organization, then, evolved into being more of an innovative entity versus a reactive or supportive entity for all those different distributing groups. >> Well, in our regard, we've ended up with AWS as part of our activity, and, really, much of our big data activities are driven by the businesses. The connections we have with IT are more related to CRM and product data master sheets and selling in channels and all that information. >> But if you take a bunch of business-led projects and then try and centralize some aspect of them, wouldn't IT typically become the sort of shared infrastructure architecture advisor for that, and then the businesses now have a harmonized platform on which they can build shared data sets? >> Actually, in our case, that's what we did. We had a lot of our businesses that already had significant services hosted in AWS. And those were very much part of the high-data generators. So, it became a very natural evolution to continue with some of our AWS relationships and continue on to Databricks. So, as an organization today, we have three kind of main buckets for our Databricks, but, you know, any business, they can get their accounts. We try and encourage everything to get into a data link, and that's three, and Parquet formats, one of the decisions that was adapted. And then, from there, people can begin to move. You know, you can get notebooks, you can share notebooks, you can look at those things. You know, the beauty of Databricks and AWS is instant on. If I want to play around with something with a half a dozen nodes, it's great. If I need a thousand for a workload, boom, I've got it! I know, kind of others, then, with this cost and the value returned, there's really no need for permissions or coordination with other entities, and that's kind of what we wanted the businesses to have that autonomy to drive their business success. >> But, does there not to be some central value added in the way of, say, data curation through a catalog or something like that? >> Yes, so, this is not necessarily a model where all the businesses are doing all kinds of crazy things. One of the things that we shepherded by one of our CTOs and the other functions, we ended up creating a virtual community within HP. This kind of started off with a lot of "tribal elders" or "tribal leaders." With this virtual community, today we get together every two weeks, and we have presentations and discussions on all things from data science into machine learning, and that's where a lot of this activity around how do we get better at sharing. And this is fostered, kind of splinters off for additional activity. So we have one on data telemetry within our organization. We're trying to standardize more data formats and schemas for those so we can have more broader sharing. So, these things have been occurring more organically as part of a developer enablement kind of moving up rather than more of kind of dictates moving down. >> That's interesting. Potentially, really important, when you say, you're trying to standardize some of the telemetry, what are you instrumenting. Is it just all the infrastructure or is it some of the products that HP makes? >> It's definitely the products and the software. You know, like I said, we manage a huge spectrum of print products, and my apologies if I'm focusing on it, but that is what I know the best. You know, we've actually been doing telemetry and analysis since the late 90s. You know, we wanted to understand use of supplies and usage so we could do our own forecasting, and that's really, really grown over the years. You know, now, we have parts of our services organization management services, where they're offering big data analytics as part of the package, and we provide information about predictive failure of parts. And that's going to be really valuable for some of our business partners that allows them. We have all kinds fancy algorithms that we work on. The customers have specific routes that they go for servicing, and we may be able to tell them, hey, in a certain time period, we think these devices in your field so you can coordinate your route to hit those on an efficient route rather than having to make a single truck roll for one repair, and do that before a customer experiences a problem. So, it's been kind of a great example of different ways that big data can impact the business. >> You know, I think Ali mentioned in the keynote this morning about the example of a customer getting a notification that their ink's going to run out, and the chance that you get to touch that customer and get them to respond and buy, you could make millions of dollar difference, right? Let's talk about some of the business outcomes and the impact that some of your workers have done, and what it means, really, to the business. >> Right now, we're trying to migrate a lot of legacy stuff, and you know, that's kind of boring. (laughs) It's just a lot of work, but there are things that need to happen. But there's really the power of the big data platform has been really great with Databricks. I know, John Landry, one of our CTOs, he's in the personal systems group. He had a great example on some problems they had with batteries and laptops, and, you know, they have a whole bunch of analytics. They've been monitoring batteries, and they found a collection of batteries that experienced very early failure rates. I happen to be able to narrow it down to specific lots from a specific supplier, and they were able to reach out to customers to get those batteries replaced before they died. >> So, a mini-recall instead of a massive PR failure. (laughs) >> You know, it was really focused on, you know, customers didn't even know they were going to have a problem with these batteries, that they were going to die early. You know, you got to them ahead of time, told them we knew this was going to be a problem and try to help them. I mean, what a great experience for a customer. (laughs) That's just great. >> So, once you had this telemetry, and it sounds like a bunch of shared repositories, not one intergalactic one. What were some of the other use cases like, you know, like the battery predictive failure type scenarios. >> So, you know, we have some very large gaps, or not gaps, with different categories. We have clearly consumer products. You know, you sell millions and millions of those, and we have little bit of telemetry with those. I think we want to understand failures and ink levels and some of these other things. But, on our commercial web presses, these very large devices, these are very sensitive. These things are down, they have a big problem. So, these things are generating all kinds of data. All right, we have systems on a premise with customers that are alerting them to potential failures, and there's more and more activity going on there to understand predictive failure and predictive kind of tolerance slippages. I'm not super familiar with that business, but I know some guys that they've started introducing more sensors into products, specifically so they can get more data, to understand things. You know, slight variations in tensioning and paper, you know, these things that are running hundreds of feet per minute can have a large impact. So, I think that's really where we see more and more of the value coming from is being able to return that value back to the customer, not just help us make better decisions, but to get that back to the customer. You know, we're talking about expanding more customer-facing analytics in these cases, or we'll expose to customers some of the raw data, and they can build their own dashboards. Some of these industries have traditionally been very analog, so this move to digital web process and this mountain of data is a little new for them, but HP can bring a lot to the table in terms of our experience in computing and big data to help them with their businesses. >> All right, great stuff. And we just got a minute to go before we're done. I have two questions for you, the first is an easy yes/no question. >> John: Okay. >> Is Purdue going to repeat as Big 10 champ in basketball? >> Oh, you know, I don't know. (laughs) I hope so! >> We both went to Purdue. >> I'm more focused on the Warriors winning. (laughter) >> All right, go Warriors! And, the real question is, what surprised you the most? This is your first Spark Summit. What surprised you the most about the event? >> So, you know, you see a lot of Internet-born companies, and it's amazing how many people have just gone fully native with Spark all over the place, and it's a beautiful thing to see. You know, in larger enterprises, that transition doesn't happen like that. I'm kind of jealous. (laughter) We have a lot more things slug through, but the excitement here and all the things that people are working on, you know, you can only see so many tracks. I'm going to have to spend two days when I get back, just watching the videos on all of the tracks I couldn't attend. >> All right, Internet-born companies versus the big enterprise. Good luck herding those cats, and thank you for sharing your story with us today and talking a little bit about the culture there at HP. >> John: Thank you very much. >> And thank you all for watching this segment of theCube. Stay with us, we're still covering Spark Summit 2017. This is Day Two, and we're not done yet. We'll see you in a few minutes. (theCube jingle)
SUMMARY :
covering Spark Summit 2017, brought to you by Databricks. Welcome back to theCube at Spark Summit 2017. all right, that's the guy who talks about herding cats, and is that related to maybe the organization at HP? and a allay many of the fears that they had and it has a lot of data collection. I'm mostly on the 2D side, that you're working on and we had a lot of groups that were standing up I'm sure you talked to enterprise companies along the way. the ones we hear a lot about is all around the company. and we really needed to come up with So that the IT organization, then, evolved and selling in channels and all that information. and Parquet formats, one of the decisions that was adapted. One of the things that we shepherded or is it some of the products that HP makes? and that's really, really grown over the years. and the chance that you get to touch that customer a lot of legacy stuff, and you know, that's kind of boring. So, a mini-recall instead of a massive PR failure. You know, it was really focused on, you know, What were some of the other use cases like, you know, and we have little bit of telemetry with those. And we just got a minute to go before we're done. Oh, you know, I don't know. I'm more focused on the Warriors winning. And, the real question is, what surprised you the most? and it's a beautiful thing to see. and thank you for sharing your story with us today And thank you all for watching this segment of theCube.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George | PERSON | 0.99+ |
John | PERSON | 0.99+ |
John Cavanaugh | PERSON | 0.99+ |
David | PERSON | 0.99+ |
John Landry | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
HP | ORGANIZATION | 0.99+ |
San Francisco | LOCATION | 0.99+ |
2015 | DATE | 0.99+ |
millions | QUANTITY | 0.99+ |
two questions | QUANTITY | 0.99+ |
two days | QUANTITY | 0.99+ |
two | QUANTITY | 0.99+ |
Ali | PERSON | 0.99+ |
one | QUANTITY | 0.99+ |
first | QUANTITY | 0.99+ |
Databricks | ORGANIZATION | 0.99+ |
half a dozen | QUANTITY | 0.99+ |
Hadoop | TITLE | 0.99+ |
three | QUANTITY | 0.99+ |
late 90s | DATE | 0.98+ |
Warriors | ORGANIZATION | 0.98+ |
Spark Summit 2017 | EVENT | 0.98+ |
hundreds of feet | QUANTITY | 0.98+ |
dozens | QUANTITY | 0.98+ |
One | QUANTITY | 0.98+ |
both | QUANTITY | 0.96+ |
Spark Summit | EVENT | 0.96+ |
today | DATE | 0.96+ |
hundreds of feet per minute | QUANTITY | 0.94+ |
Spark | ORGANIZATION | 0.93+ |
single truck | QUANTITY | 0.93+ |
a thousand | QUANTITY | 0.92+ |
one repair | QUANTITY | 0.92+ |
five | QUANTITY | 0.91+ |
this morning | DATE | 0.89+ |
Purdue | ORGANIZATION | 0.87+ |
Sprocket | ORGANIZATION | 0.87+ |
2D | QUANTITY | 0.84+ |
Day Two | QUANTITY | 0.83+ |
every two weeks | QUANTITY | 0.81+ |
dollar | QUANTITY | 0.81+ |
three kind | QUANTITY | 0.81+ |
two large business segments | QUANTITY | 0.7+ |
Spark | TITLE | 0.69+ |
five different | QUANTITY | 0.64+ |
Herding Cats | ORGANIZATION | 0.64+ |
about | QUANTITY | 0.6+ |
10-node | QUANTITY | 0.57+ |
so many | QUANTITY | 0.51+ |
Purdue | EVENT | 0.48+ |
theCube | ORGANIZATION | 0.47+ |
Parquet | TITLE | 0.42+ |
eally | ORGANIZATION | 0.3+ |
10 | QUANTITY | 0.28+ |
Yaron Haviv | BigData SV 2017
>> Announcer: Live from San Jose, California, it's the CUBE, covering Big Data Silicon Valley 2017. (upbeat synthesizer music) >> Live with the CUBE coverage of Big Data Silicon Valley or Big Data SV, #BigDataSV in conjunction with Strata + Hadoop. I'm John Furrier with the CUBE and my co-host George Gilbert, analyst at Wikibon. I'm excited to have our next guest, Yaron Haviv, who's the founder and CTO of iguazio, just wrote a post up on SiliconANGLE, check it out. Welcome to the CUBE. >> Thanks, John. >> Great to see you. You're in a guest blog this week on SiliconANGLE, and always great on Twitter, cause Dave Alante always liked to bring you into the contentious conversations. >> Yaron: I like the controversial ones, yes. (laughter) >> And you add a lot of good color on that. So let's just get right into it. So your company's doing some really innovative things. We were just talking before we came on camera here, about some of the amazing performance improvements you guys have on many different levels. But first take a step back, and let's talk about what this continuous analytics platform is, because it's unique, it's different, and it's got impact. Take a minute to explain. >> Sure, so first a few words on iguazio. We're developing a data platform which is unified, so basically it can ingest data through many different APIs, and it's more like a cloud service. It is for on-prem and edge locations and co-location, but it's managed more like a cloud platform so very similar experience to Amazon. >> John: It's software? >> It's software. We do integrate a lot with hardware in order to achieve our performance, which is really about 10 to 100 times faster than what exists today. We've talked to a lot of customers and what we really want to focus with customers in solving business problems, Because I think a lot of the Hadoop camp started with more solving IT problems. So IT is going kicking tires, and eventually failing based on your statistics and Gardner statistics. So what we really wanted to solve is big business problems. We figured out that this notion of pipeline architecture, where you ingest data, and then curate it, and fix it, et cetera, which was very good for the early days of Hadoop, if you think about how Hadoop started, was page ranking from Google. There was no time sensitivity. You could take days to calculate it and recalibrate your search engine. Based on new research, everyone is now looking for real time insights. So there is sensory data from (mumbles), there's stock data from exchanges, there is fraud data from banks, and you need to act very quickly. So this notion of and I can give you examples from customers, this notion of taking data, creating Parquet file and log files, and storing them in S3 and then taking Redshift and analyzing them, and then maybe a few hours later having an insight, this is not going to work. And what you need to fix is, you have to put some structure into the data. Because if you need to update a single record, you cannot just create a huge file of 10 gigabyte and then analyze it. So what we did is, basically, a mechanism where you ingest data. As you ingest the data, you can run multiple different processes on the same thing. And you can also serve the data immediately, okay? And two examples that we demonstrate here in the show, one is video surveillance, very nice movie-style example, that you, basically, ingest pictures for S3 API, for object API, you analyze the picture to detect faces, to detect scenery, to extract geolocation from pictures and all that, all those through different processes. TensorFlow doing one, serverless functions that we have, do other simpler tasks. And in the same time, you can have dashboards that just show everything. And you can have Spark, that basically does queries of where was this guys last seen? Or who was he with, you know, or think about the Boston Bomber example. You could just do it in real time. Because you don't need this notion of pipeline. And this solves very hard business problems for some of the customers we work with. >> So that's the key innovation, there's no pipe lining. And what's the secret sauce? >> So first, our system does about a couple of million of transactions per second. And we are a multi-modal database. So, basically, you can ingest data as a stream, exactly the same data could be read by Spark as a table. So you could, basically, issue a query on the same data. Give me everything that has a certain pattern or something, and could also be served immediately through RESTful APIs to a dashboard running AngularJS or something like that. So that's the secret sauce, is by having this integration, and this unique data model, it allows you all those things to work together. There are other aspects, like we have transactional semantics. One of the challenges is how do you make sure that a bunch of processes don't collide when they update the same data. So first you need a very low ground alert. 'cause each one may update to different field. Like this example that I gave with GeoData, the serverless function that does the GeoData extraction only updates the GeoData fields within the records. And maybe TensorFlow updates information about the image in a different location in the record or, potentially, a different record. So you have to have that, along with transaction safety, along with security. We have very tight security at the field level, identity level. So that's re-thinking the entire architecture. And I think what many of the companies you'll see at the show, they'll say, okay, Hadoop is given, let's build some sort of convenience tools around it, let's do some scripting, let's do automation. But serve the underlying thing, I won't use dirty words, but is not well-equipped to the new challenges of real time. We basically restructured everything, we took the notions of cloud-native architectures, we took the notions of Flash and latest Flash technologies, a lot of parallelism on CPUs. We didn't take anything for granted on the underlying architecture. >> So when you found the company, take a personal story here. What was the itch you were scratching, why did you get into this? Obviously, you have a huge tech advantage, which is, will double-down with the research piece and George will have some questions. What got you going with the company? You got a unique approach, people would love to do away with the pipeline, that sounds great. And the performance, you said about 100x. So how did you get here? (laughs) Tell the story. >> So if you know my background, I ran all the data center activities in Mellanox, and you know Mellanox, I know Kevin was here. And my role was to take Mellanox technology, which is 100 gig networking and silicon, and fit it into the different applications. So I worked with SAP HANA, I worked with Teradata, I worked on Oracle Exadata, I work with all the cloud service providers on building their own object storage and NoSQL and other solutions. I also owned all the open source activities around Hadoop and Saf and all those projects, and my role was to fix many of those. If a customer says I don't need 100 gig, it's too fast for me, how do I? And my role was to convince him that yes, I can open up all the bottleneck all the way up to your stack so you can leverage those new technologies. And for that we basically sowed inefficiencies in those stacks. >> So you had a good purview of the marketplace. >> Yaron: Yes. >> You had open source on one hand, and then all the-- >> All the storage players, >> vendors, network. >> all the database players and all the cloud service providers were my customers. So you're a very unique point where you see the trajectory of cloud. Doing things totally different, and sometimes I see the trajectory of enterprise storage, SAN, NAS, you know, all Flash, all that, legacy technologies where cloud providers are all about object, key value, NoSQL. And you're trying to convince those guys that maybe they were going the wrong way. But it's pretty hard. >> Are they going the wrong way? >> I think they are going the wrong way. Everyone, for example, is running to do NVMe over Fabric now that's the new fashion. Okay, I did the first implementation of NVMe over Fabric, in my team at Mellanox. And I really loved it, at that time, but databases cannot run on top of storage area networks. Because there are serialization problems. Okay, if you use a storage area network, that mean that every node in the cluster have to go and serialize an operation against the shared media. And that's not how Google and Amazon works. >> There's a lot more databases out there too, and a lot more data sources. You've got the Edge. >> Yeah, but all the new databases, all the modern databases, they basically shared the data across the different nodes so there are no serialization problems. So that's why Oracle doesn't scale, or scale to 10 nodes at best, with a lot of RDMA as a back plane, to allow that. And that's why Amazon can scale to a thousand nodes, or Google-- >> That's the horizontally-scalable piece that's happening. >> Yeah, because, basically, the distribution has to move into the higher layers of the data, and not the lower layers of the data. And that's really the trajectory where the traditional legacy storage and system vendors are going, and we sort of followed the way the cloud guys went, just with our knowledge of the infrastructure, we sort of did it better than what the cloud guys did. 'Cause the cloud guys focused more on the higher levels of the implementation, the algorithms, the Paxos, and all that. Their implementation is not that efficient. And we did both sides extremely efficient. >> How about the Edge? 'Cause Edge is now part of cloud, and you got cloud has got the compute, all the benefits, you were saying, and still they have their own consumption opportunities and challenges that everyone else does. But Edge is now exploding. The combination of those things coming together, at the intersection of that is deep learning, machine learning, which is powering the AI hype. So how is the Edge factoring into your plan and overall architectures for the cloud? >> Yeah, so I wrote a bunch of posts that are not published yet about the Edge, But my analysis along with your analysis and Pierre Levin's analysis, is that cloud have to start distribute more. Because if you're looking at the trends. Five gig, 5G Wi-Fi in wireless networking is going to be gigabit traffic. Gigabit to the homes, they're going to buy Google, 70 bucks a month. It's going to push a lot more bend with the Edge. On the same time, a cloud provider, is in order to lower costs and deal with energy problems they're going to rural areas. The traditional way we solve cloud problems was to put CDNs, so every time you download a picture or video, you got to a CDN. When you go to Netflix, you don't really go to Amazon, you got to a Netflix pop, one of 250 locations. The new work loads are different because they're no longer pictures that need to be cashed. First, there are a lot of data going up. Sensory data, upload files, et cetera. Data is becoming a lot more structured. Censored data is structured. All this car information will be structured. And you want to (mumbles) digest or summarize the data. So you need technologies like machine learning, NNI and all those things. You need something which is like CDNs. Just mini version of cloud that sits somewhere in between the Edge and the cloud. And this is our approach. And now because we can string grab the mini cloud, the mini Amazon in a way more dense approach, then this is a play that we're going to take. We have a very good partnership with Equinox. Which has 170 something locations with very good relations. >> So you're, essentially, going to disrupt the CDN. It's something that I've been writing about and tweeting about. CDNs were based on the old Yahoo days. Cashing images, you mentioned, give me 1999 back, please. That's old school, today's standards. So it's a whole new architecture because of how things are stored. >> You have to be a lot more distributive. >> What is the architecture? >> In our innovation, we have two layers of innovation. One is on the lower layers of, we, actually, have three main innovations. One is on the lower layers of what we discussed. The other one is the security layer, where we classify everything. Layer seven at 100 gig graphic rates. And the third one is all this notion of distributed system. We can, actually, run multiple systems in multiple locations and manage them as one logical entity through high level semantics, high level policies. >> Okay, so when we take the CUBE global, we're going to have you guys on every pop. This is a legit question. >> No it's going to take time for us. We're not going to do everything in one day and we're starting with the local problems. >> Yeah but this is digital transmissions. Stay with me for a second. Stay with this scenario. So video like Netflix is, pretty much, one dimension, it's video. They use CDNs now but when you start thinking in different content types. So, I'm going to have a video with, maybe, just CGI overlayed or social graph data coming in from tweets at the same time with Instagram pictures. I might be accessing multiple data everywhere to watch a movie or something. That would require beyond a CDN thinking. >> And you have to run continuous analytics because it can not afford batch. It can not afford a pipeline. Because you ingest picture data, you may need to add some subtext with the data and feed it, directly, to the consumer. So you have to move to those two elements of moving more stuff into the Edge and running into continuous analytics versus a batch on pipeline. >> So you think, based on that scenario I just said, that there's going to be an opportunity for somebody to take over the media landscape for sure? >> Yeah, I think if you're also looking at the statistics. I seen a nice article. I told George about it. That analyzing the Intel cheap distribution. What you see is that there is a 30% growth on Intel's cheap Intel Cloud which is faster than what most analysts anticipate in terms of cloud growth. That means, actually, that cloud is going to cannibalize Enterprise faster than what most think. Enterprise is shrinking about 7%. There is another place which is growing. It's Telcos. It's not growing like cloud but part of it is because of this move towards the Edge and the move of Telcos buying white boxes. >> And 5G and access over the top too. >> Yeah but that's server chips. >> Okay. >> There's going to be more and more computation in the different Telco locations. >> John: Oh you're talking about computer, okay. >> This is an opportunity that we can capitalize on if we run fast enough. >> It sounds as though because you've implemented these industry standard APIs that come from the, largely, the open source ecosystem, that you can propagate those to areas on the network that the vendors, who are behind those APIs can't, necessarily, do. Into the Telcos, towards the Edge. And, I assume, part of that is cause of the density and the simplicity. So, essentially, your footprint's smaller in terms of hardware and the operational simplicity is greater. Is that a fair assessment? >> Yes and also, we support a lot of Amazon compatible APIs which are RESTful, typically, HTTP based. Very convenient to work with in a cloud environment. Another thing is, because we're taking all the state on ourself, the different forms of states whether it's a message queue or a table or an object, et cetera, that makes the computation layer very simple. So one of the things that we are, also, demonstrating is the integration we have with Kubernetes that, basically, now simplifies Kubernetes. Cause you don't have to build all those different data services for cloud native infrastructure. You just run Kubernetes. We're the volume driver, we're the database, we're the message queues, we're everything underneath Kubernetes and then, you just run Spark or TensorFlow or a serverless function as a Kubernetes micro service. That allows you now, elastically, to increase the number of Spark jobs that you need or, maybe, you have another tenant. You just spun a Spark job. YARN has some of those attributes but YARN is very limited, very confined to the Hadoop Ecosystem. TensorFlow is not a Hadoop player and a bunch of those new tools are not in Hadoop players and everyone is now adopting a new way of doing streaming and they just call it serverless. serverless and streaming are very similar technologies. The advantage of serverless is all this pre-packaging and all this automation of the CICD. The continuous integration, the continuous development. So we're thinking, in order to simplify the developer in an operation aspects, we're trying to integrate more and more with cloud native approach around CICD and integration with Kubernetes and cloud native technologies. >> Would it be fair to say that from a developer or admin point of view, you're pushing out from the cloud towards the Edge faster than if the existing implementations say, the Apache Ecosystem or the AWS Ecosystem where AWS has something on the edge. I forgot whether it's Snowball or Green Grass or whatever. Where they at least get the lambda function. >> They're field by the way and it's interesting to see. One of the things they allowed lambda functions in their CDS which is going the direction I mentioned just for a minimal functionality. Another thing is they have those boxes where they have a single VM and they can run lambda function as well. But I think their ability to run computation is very limited and also, their focus is on shipping the boxes through mail and we want it to be always connected. >> Our final question for you, just to get your thoughts. Great save up, by the way. This is very informative. Maybe be should do a follow up on Skype in our studio for Silocon Friday show. Google Next was interesting. They're serious about the Enterprise but you can see that they're not yet there. What is the Enterprise readiness from your perspective? Cause Google has the tech and they try to flaunt the tech. We're great, we're Google, look at us, therefore, you should buy us. It's not that easy in the Enterprise. How would you size up the different players? Because they're all not like Amazon although Amazon is winning. You got Amazon, Azure and Google. Your thoughts on the cloud players. >> The way we attack Enterprise, we don't attack it from an Enterprise perspective or IT perspective, we take it from a business use case perspective. Especially, because we're small and we have to run fast. You need to identify a real critical business problem. We're working with stock exchanges and they have a lot of issues around monitoring the daily trade activities in real time. If you compare what we do with them on this continuous analytics notion to how they work with Excel's and Hadoops, it's totally different and now, they could do things which are way different. I think that one of the things that Hadook's customer, if Google wants to succeed against Amazon, they have to find the way of how to approach those business owners and say here's a problem Mr. Customer, here's a business challenge, here's what I'm going to solve. If they're just going to say, you know what? My VM's are cheaper than Amazon, it's not going to be a-- >> Also, they're doing the whole, they're calling lift and shift which is code word for rip and replace in the Enterprise. So that's, essentially, I guess, a good opportunity if you can get people to do that but not everyone's ripping and replacing and lifting and shifting. >> But a lot of Google advantages around areas of AI and things like that. So they should try and leverage, if you think about Amazon approach to AI, this fund the university to build a project and then set it's hours where Google created TensorFlow and created a lot of other IPs and Dataflow and all those solutions and consumered it to the community. I really love Google's approach of contributing Kubernetes, to contributing TensorFlow. And this way, they're planting the seeds so the new generation this is going to work with Kubernetes and TensorFlow who are going to say, "You know what?" "Why would I mess with this thing on (mumbles) just go and. >> Regular cloud, do multi-cloud. >> Right to the cloud. But I think a lot of criticism about Google is that they're too research oriented. They don't know how to monetize and approach the-- >> Enterprise is just a whole different drum beat and I think that's the only thing on my complaint with them, they got to get that knowledge and/or buy companies. Have a quick final point on Spanner or any analysis of Spanner that went from paper, pretty quickly, from paper to product. >> So before we started iguazio, I started Spanner quite a bit. All the publication was there and all the other things like Spanner. Spanner has the underlying layer called Colossus. And our data layer is very similar to how Colossus works. So we're very familiar. We took a lot of concepts from Spanner on our platform. >> And you like Spanner, it's legit? >> Yes, again. >> Cause you copied it. (laughs) >> Yaron: We haven't copied-- >> You borrowed some best practices. >> I think I cited about 300 research papers before we did the architecture. But we, basically, took the best of each one of them. Cause there's still a lot of issues. Most of those technologies, by the way, are designed for mechanical disks and we can talk about it in a different-- >> And you have Flash. Alright, Yaron, we have gone over here. Great segment. We're here, live in Silicon Valley, breakin it down, getting under the hood. Looking a 10X, 100X performance advantages. Keep an eye on iguazio, they're looking like they got some great products. Check them out. This is the CUBE. I'm John Furrier with George Gilbert. We'll be back with more after this short break. (upbeat synthesizer music)
SUMMARY :
it's the CUBE, covering Big Welcome to the CUBE. to bring you into the Yaron: I like the about some of the amazing and it's more like a cloud service. And in the same time, So that's the key innovation, So that's the secret sauce, And the performance, you said about 100x. and fit it into the purview of the marketplace. and all the cloud service that's the new fashion. You've got the Edge. Yeah, but all the new databases, That's the horizontally-scalable and not the lower layers of the data. So how is the Edge digest or summarize the data. going to disrupt the CDN. One is on the lower layers of, we're going to have you guys on every pop. the local problems. So, I'm going to have a video with, maybe, of moving more stuff into the Edge and the move of Telcos buying white boxes. in the different Telco locations. John: Oh you're talking This is an opportunity that we and the operational simplicity is greater. is the integration we have with Kubernetes the Apache Ecosystem or the AWS Ecosystem One of the things they It's not that easy in the Enterprise. to say, you know what? and replace in the Enterprise. and consumered it to the community. Right to the cloud. that's the only thing and all the other things like Spanner. Cause you copied it. and we can talk about it in a different-- This is the CUBE.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George Gilbert | PERSON | 0.99+ |
George | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Telcos | ORGANIZATION | 0.99+ |
Yaron Haviv | PERSON | 0.99+ |
ORGANIZATION | 0.99+ | |
Equinox | ORGANIZATION | 0.99+ |
John | PERSON | 0.99+ |
Mellanox | ORGANIZATION | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
Telco | ORGANIZATION | 0.99+ |
Kevin | PERSON | 0.99+ |
Dave Alante | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Yaron | PERSON | 0.99+ |
Silicon Valley | LOCATION | 0.99+ |
Pierre Levin | PERSON | 0.99+ |
100 gig | QUANTITY | 0.99+ |
AngularJS | TITLE | 0.99+ |
San Jose, California | LOCATION | 0.99+ |
30% | QUANTITY | 0.99+ |
John Furrier | PERSON | 0.99+ |
One | QUANTITY | 0.99+ |
two examples | QUANTITY | 0.99+ |
First | QUANTITY | 0.99+ |
third one | QUANTITY | 0.99+ |
Skype | ORGANIZATION | 0.99+ |
one day | QUANTITY | 0.99+ |
Netflix | ORGANIZATION | 0.99+ |
10 gigabyte | QUANTITY | 0.99+ |
Teradata | ORGANIZATION | 0.99+ |
two elements | QUANTITY | 0.99+ |
CUBE | ORGANIZATION | 0.99+ |
Spanner | TITLE | 0.99+ |
Oracle | ORGANIZATION | 0.99+ |
S3 | TITLE | 0.99+ |
first | QUANTITY | 0.99+ |
1999 | DATE | 0.98+ |
two layers | QUANTITY | 0.98+ |
Excel | TITLE | 0.98+ |
both sides | QUANTITY | 0.98+ |
Spark | TITLE | 0.98+ |
Five gig | QUANTITY | 0.98+ |
Kubernetes | TITLE | 0.98+ |
Paxos | ORGANIZATION | 0.98+ |
Intel | ORGANIZATION | 0.98+ |
100X | QUANTITY | 0.98+ |
Azure | ORGANIZATION | 0.98+ |
Colossus | TITLE | 0.98+ |
about 7% | QUANTITY | 0.98+ |
Yahoo | ORGANIZATION | 0.98+ |
Hadoop | TITLE | 0.97+ |
Boston Bomber | ORGANIZATION | 0.97+ |
Joel Cumming, Kik - Spark Summit East 2017 - #SparkSummit - #theCUBE
>> Narrator: Live from Boston, Massachusetts this is the Cube, covering Spark Summit East 2017 brought to you by Databricks. Now, here are your hosts, Dave Vellante and George Gilbert. >> Welcome back to Boston, everybody, where it's a blizzard outside and a blizzard of content coming to you from Spark Summit East, #SparkSummit. This is the Cube, the worldwide leader in live tech coverage. Joel Cumming is here. He's the head of data at Kik. Kicking butt at Kik. Welcome to the Cube. >> Thank you, thanks for having me. >> So tell us about Kik, this cool mobile chat app. Checked it out a little bit. >> Yeah, so Kik has been around since about 2010. We're, as you mentioned, a mobile chat app, start-up based in Waterloo, Ontario. Kik really took off, really 2010 when it got 2 million users in the first 22 days of its existence. So was insanely popular, specifically with U.S. youth, and the reason for that really is Kik started off in a time where chatting through text cost money. Text messages cost money back in 2010, and really not every kid has a phone like they do today. So if you had an iPod or an iPad all you needed to do was sign up, and you had a user name and now you could text with your friends, so kids could do that just like their parents could with Kik, and that's really where we got our entrenchment with U.S. youth. >> And you're the head of data. So talk a little bit about your background. What does that mean to be a head of data? >> Yes, so prior to working at Kik I worked at Blackberry, and I like to say I worked at Blackberry probably around the time just before you bought your first Blackberry and I left just after you bought your first iPhone. So kind of in that range, but was there for nine years. >> Vellante: Can you do that with real estate? >> Yeah, I'd love to be able to do that with real estate. But it was a great time at Blackberry. It was very exciting to be part of that growth. When I was there, we grew from three million to 80 million customers, from three thousand employees to 17 thousand employees, and of course, things went sideways for Blackberry, but conveniently at the end Blackberry was working in BBM, and leading a team of data scientists and data engineers there. And BBM if you're not familiar with it is a chat app as well, and across town is where Kik is headquartered. The appeal to me of moving to Kik was a company that was very small and fast moving, but they actually weren't leveraging data at all. So when I got there, they had a pile of logs sitting in S3, waiting for someone to take advantage of them. They were good at measuring events, and looking at those events and how they tracked over time, but not really combining them to understand or personalize any experience for their end customers. >> So they knew enough to keep the data. >> They knew enough to keep the data. >> They just weren't sure what to do with it. Okay so, you come in, and where did you start? >> So the first day that I started that was the first day I used any AWS product, so I had worked on the big data tools at the old place, with Hadoop and Pig and Hive and Oracle and those kinds of things, but had never used an AWS product until I got there and it was very much sink or swim and on my first day our CEO in the meeting said, "Okay, you're data guy here now. "I want you to tell me in a week why people leave Kik." And I'm like, man we don't even have a database yet. The first thing I did was I fired up a Redshift cluster. First time I had done that, looked at the tools that were available in AWS to transform the data using EMR and Pig and those kinds of things, and was lucky enough, fortunate enough that they could figure that out in a week and I didn't give him the full answer of why people left, but I was able to give him some ideas of places we could go based on some preliminary exploration. So I went from leading this team of about 40 people to being a team of one and writing all the code myself. Super exciting, not the experience that everybody wants, but for me it was a lot of fun. Over the last three years have built up the team. Now we have three data engineers and three data scientists and indeed it's a lot more important to people every day at Kik. >> What sort of impact has your team had on the product itself and the customer experience? >> So the beginning it was really just trying to understand the behaviors of people across Kik, and that took a while to really wrap our heads around, and any good data analysis combines behaviors that you have to ask people their opinion on and also behaviors that we see them do. So I had an old boss that used to work at Rogers, which is a telecomm provider in Canada, and he said if you ask people the things that they watch they tell you documentaries and the news and very important stuff, but if you see what they actually watch it's reality TV and trashy shows, and so the truth is really somewhere in the middle. There's an aspirational element. So for us really understanding the data we already had, instrumenting new events, and then in the last year and a half, building out an A/B testing framework is something that's been instrumental in how we leverage data at Kik. So we were making decisions by gut feel in the very beginning, then we moved into this era where we were doing A/B testing and very focused on statistical significance, and rigor around all of our experiments, but then stepping back and realizing maybe the bets that we have aren't big enough. So we need to maybe bet a little bit more on some bigger features that have the opportunity to move the needle. So we've been doing that recently with a few features that we've released, but data is super important now, both to stimulate creativity of our product managers as well as to measure the success of those features. >> And how do you map to the product managers who are defining the new features? Are you a central group? Are you sort of point guards within the different product groups? How does that, your evidence-based decisions or recommendations but they make ultimately, presumably, the decisions. What's the dynamic? >> So it's a great question. In my experience, it's very difficult to build a structure that's perfect. So in the purely centralized model you've got this problem of people are coming to you to ask for something, and they may get turned away because you're too busy, and then in the decentralized model you tend to have lots of duplication and overlap and maybe not sharing all the things that you need to share. So we tried to build a hybrid of both. And so we had our data engineers centralized and we tried doing what we called tours of duty, so our data scientists would be embedded with various teams within the company so it could be, it could be the core messenger team. It could be our bot platform team. It could be our anti-spam team. And they would sit with them and it's very easy for product managers and developers to ask them questions and for them to give out answers, and then we would rotate those folks through a different tour of duty after a few months and they would sit with another team. So we did that for a while, and it worked pretty well, but one of the major things we found was a problem was there's no good checkpoint to confirm that what they're doing is right. So in software development you're releasing a version of software. There's QA, there's code review and there's structure in place to ensure that yes, this number I'm providing is right. It's difficult when you've got a data scientist who's out with a team for him to come back to the team and get that peer review. So now we're kind of reevaluating that. We use an agile approach, but we have primes for each of these groups but now we all sit together. >> So the accountability is after the data scientist made a recommendation that the product manager agrees with, how do you ensure that it measured up to the expectation? Like sort of after the fact. >> Yeah, so in those cases our A/B tests are it's nice to have that unbiased data resource on the team that's embedded with them that can step back and say yes, this idea worked, or it didn't work. So that's the approach that we're taking. It's not a dedicated resource, but a prime resource for each of these teams that's a subject matter expert and then is evaluating the results in an unbiased kind of way. >> So you've got this relatively small, even though it's quadruple the size when you started, data team and then application development team as sort of colleagues or how do you interact with them? >> Yeah, we're actually part of the engineering organization at Kik, part of R and D, and in different times in my life I've been part of different organizations whether it's marketing or whether it's I.T. or whether it's R and D, and R and D really fits nicely. And the reason why I think it's the best is because if there's data that you need to understand users more there's much more direct control over getting that element instrumented within a product that you have when you're part of R and D. If you're in marketing, you're like hey, I'd love to know how many times people tap on that red button, but no event fires when that red button is tapped. Good luck trying to get the software developers to put that in. But when there's an inherent component of R and D that's dependent on data, and data has that direct path to those developers, getting that kind of thing done is much easier. >> So from a tooling standpoint, thinking about data scientists and data engineers, a lot of the tools that we've seen in this so-called big data world have been quite spoke. Different interfaces, different experience. How are you addressing that? Does Spark help with that? Maybe talk about that a bit more. >> Yeah, so I was fortunate enough to do a session today that sort of talked about data V1 at Kik versus data V2 at Kik, and we drew this kind of a line in the sand. So when I started it was just me. I'm trying to answer these questions very quickly on these three or five day timelines that we get from our CEO. >> Vallente: You've been here a week, come on! >> Yeah exactly, so you sacrifice data engineering and architecture when you're living like that. So you can answer questions very quickly. It worked well for a while, but then all of a sudden we come up and we have 300 data pipelines. They're a mess. They're hard to manage and control. We've got code sometimes in Sequel or sometimes in Python scripts, or sometimes on people's laptops. We have no real plan for Getup integration. And then you know real scalability out of Redshift. We were doing a lot of our workloads in Redshift to do transformations just because, get the data into Redshift, write some Sequel and then have your results. We're running into contention problems with that. So what we decided to do is sort of stop, step back and say, okay so how are we going to house all of this atomic data that we have in a way that's efficient. So we started with Redshift, our database was 10 terabytes. Now it's 100, except for we get five terabytes of data per day that's new coming in, so putting that all in Redshift, it doesn't make sense. It's not all that useful. So if we cull that data under supervision, we don't want to get rid of the atomic data, how do we control that data under supervision. So we decided to go the data lake route, even though we hate the term data lake, but basically a folder structure within S3 that's stored in a query optimized format like Parquet, and now we can access that data very quickly at an atomic level, at a cleansed level and also an at aggregate level. So for us, this data V2 was the evolution of stopping doing a lot of things the way we used to do, which was lots of data pipelines, kind of code that was all over the place, and then aggregations in Redshift, and starting to use Spark, specifically Databricks. Databricks we think of in two ways. One is kind of managed Spark, so that we don't have to do all the configuration that we used to have to do with EMR, and then the second is notebooks that we can align with all the work that we're doing and have revision control and Getup integration as well. >> A question to clarify, when you've put the data lake, which is the file system and then the data in Parquet format, or Parquet files, so this is where you want to have some sort of interactive experience for business intelligence. Do you need some sort of MPP server on top of that to provide interactive performance, or, because I know a lot customers are struggling at that point where they got all the data there, and it's kind of organized, but then if they really want to munge through that huge volume they find it slows to lower than a crawl. >> Yeah, it's a great point. And we're at the stage right now where our data lake at the top layer of our data lake where we aggregate and normalize, we also push that data into Redshift. So Redshift what we're trying to do with that is make it a read-only environment, so that our analysts and developers, so they know they have consistent read performance on Redshift, where before when it's a mix of batch jobs as well as read workload, they didn't have that guarantee. So you're right, and we think what will probably happen over the next year or so is the advancements in Spark will make it much more capable as a data warehousing product, and then you'd have to start a question do I need both Redshift and Spark for that kind of thing? But today I think some of the cost-based optimizations that are coming, at least the promise of them coming I would hope that those would help Spark becoming more of a data warehouse, but we'll have to see. >> So carry that thread a little further through. I mean in terms of things that you'd like to see in the Spark roadmap, things that could be improved. What's your feedback to Databricks? >> We're fortunate, we work with them pretty closely. We've been a customer for about half a year, and they've been outstanding working with us. So structured streaming is a great example of something we worked pretty closely with on. We're really excited about. We don't have, you know we have certain pockets within our company that require very real-time data, so obviously your operational components. Are your servers up or down, as well as our anti-spam team. They require very low latency access to data. We haven't typically, if we batch every hour that's fine in most cases, but structured streaming when our data streams are coming in now through Kinesis Firehose, and we can process those without have to worry about checking to see if it's time we should start this or is all the data there so we can run this batch. Structured streaming solves a lot of those, it simplifies a lot of that workload for us. So that's something we've been working with them on. The other things that we're really interested in. We've got a bit of list, but the other major ones are how do you start to leverage this data to use it for personalization back in the app? So today we think of data in two ways at Kik. It's data as KPIs, so it's like the things you need to run your business, maybe it's A/B testing results, maybe it's how many active users you had yesterday, that kind of thing. And then the second is data as a product, and how do you provide personalization at an individual level based on your data sciences models back out to the app. So we do that, I should point out at Kik we don't see anybody's messages. We don't read your messages. We don't have access to those. But we have the metadata around the transactions that you have, like most companies do. So that helps us improve our products and services under our privacy policy to say okay, who's building good relationships and who's leaving the platform and why are they doing it. But we can also service components that are useful for personalization, so if you've chatted with three different bots on our platform that's important for us to know if we want to recommend another bot to you. Or you know the classic people people you may know recommendations. We don't do that right now, but behind the scenes we have the kind of information that we could help personalize that experience for you. So those two things are very different. In a lot of companies there's an R and D element, like at Blackberry, the app world recommendation engine was something that there was a team that ran in production but our team was helping those guys tweak and tune their models. So it's the same kind of thing at Kik where we can build, our data scientist are building models for personalization, and then we need to service them back up to the rest of the company. And the process right now of taking the results of our models and then putting them into a real time serving system isn't that clean, and so we do batches every day on things that don't need to be near real-time, so things like predicted gender. If we know your first name, we've downloaded the list of baby names from the U.S. Social Security website and we can say the frequency of the name Pat 80 percent of the time it's a male, and 20 percent it's a female, but Joel is 99 percent of the time it's male and one percent of the time it's a female, so based on your tolerance for whatever you want to use this personalization for we can give you our degrees of confidence on that. That's one example of what we surface rate now in our API back to our own first party components of our app. But in the future with more real-time data coming in from Spark streaming with more real-time model scoring, and then the ability to push that over into some sort of capability that can be surfaced up through an API, it gives our data team the capability of being much more flexible and fast at surfacing things that can provide personalization to the end user, as opposed to what we have now which is all this batch processing and then loading once a day and then knowing that we can't react on the fly. >> So if I were to try and turn that into a sort of a roadmap, a Spark roadmap, it sounds like the process of taking the analysis and doing perhaps even online training to update the models, or just rescoring if you're doing a little slightly less fresh, but then serving it up from a high speed serving layer, that's when you can take data that's coming in from the game and send it back to improve the game in real time. >> Exactly. Yep. >> That's what you're looking for. >> Yeah. >> You and a lot of other people. >> Yeah I think so. >> So how's the event been for you? >> It's been great. There's some really smart people here. It's humbling when you go to some of these sessions and you know, we're fortunate where we try and not have to think about a lot of the details that people are explaining here, but it's really good to understand them and know that there are some smart people that are fixing these problems. As like all events, been some really good sessions, but the networking is amazing, so meeting lots of great people here, and hearing their stories too. >> And you're hoping to go to the hockey game tonight. >> Yeah, I'd love to go to the hockey game. See if we can get through the snow. >> Who are the Bruins playing tonight. >> San Jose. >> Oh, good. >> It could be a good game. >> Yeah, the rivalry. You guys into the hockey game? Alright, good. Alright, Joel, listen, thanks very much for coming on the Cube. Great segment. I really appreciate your insights and sharing. >> Okay, thanks for having me. >> You're welcome. Alright, keep it right there, everybody. George and I will be back right after this short break. This is the Cube. We're live from Spark Summit in Boston.
SUMMARY :
brought to you by Databricks. and a blizzard of content coming to you So tell us about Kik, this cool mobile chat app. and the reason for that really is Kik started off What does that mean to be a head of data? and I like to say I worked at Blackberry but conveniently at the end Blackberry was working Okay so, you come in, and where did you start? and on my first day our CEO in the meeting said, and also behaviors that we see them do. And how do you map to the product managers but one of the major things we found was a problem So the accountability is after the data scientist So that's the approach that we're taking. and data has that direct path to those developers, a lot of the tools that we've seen and we drew this kind of a line in the sand. One is kind of managed Spark, so that we don't have to do and it's kind of organized, but then if they that are coming, at least the promise of them coming in the Spark roadmap, things that could be improved. It's data as KPIs, so it's like the things you need from the game and send it back to improve the game and not have to think about a lot of the details See if we can get through the snow. Yeah, the rivalry. This is the Cube.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
George | PERSON | 0.99+ |
George Gilbert | PERSON | 0.99+ |
Canada | LOCATION | 0.99+ |
Joel Cumming | PERSON | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
Blackberry | ORGANIZATION | 0.99+ |
2010 | DATE | 0.99+ |
Joel | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
10 terabytes | QUANTITY | 0.99+ |
20 percent | QUANTITY | 0.99+ |
nine years | QUANTITY | 0.99+ |
99 percent | QUANTITY | 0.99+ |
Boston | LOCATION | 0.99+ |
iPad | COMMERCIAL_ITEM | 0.99+ |
three million | QUANTITY | 0.99+ |
17 thousand employees | QUANTITY | 0.99+ |
Boston, Massachusetts | LOCATION | 0.99+ |
three thousand employees | QUANTITY | 0.99+ |
Kik | ORGANIZATION | 0.99+ |
three | QUANTITY | 0.99+ |
Waterloo, Ontario | LOCATION | 0.99+ |
iPod | COMMERCIAL_ITEM | 0.99+ |
three data scientists | QUANTITY | 0.99+ |
two things | QUANTITY | 0.99+ |
Python | TITLE | 0.99+ |
100 | QUANTITY | 0.99+ |
one percent | QUANTITY | 0.99+ |
first | QUANTITY | 0.99+ |
Redshift | TITLE | 0.99+ |
both | QUANTITY | 0.99+ |
2 million users | QUANTITY | 0.99+ |
80 percent | QUANTITY | 0.99+ |
iPhone | COMMERCIAL_ITEM | 0.99+ |
today | DATE | 0.99+ |
Kik | PERSON | 0.99+ |
five day | QUANTITY | 0.99+ |
each | QUANTITY | 0.99+ |
three data engineers | QUANTITY | 0.99+ |
Oracle | ORGANIZATION | 0.99+ |
second | QUANTITY | 0.99+ |
300 data pipelines | QUANTITY | 0.98+ |
One | QUANTITY | 0.98+ |
yesterday | DATE | 0.98+ |
two ways | QUANTITY | 0.98+ |
Databricks | ORGANIZATION | 0.98+ |
S3 | TITLE | 0.98+ |
one | QUANTITY | 0.98+ |
Parquet | TITLE | 0.98+ |
first day | QUANTITY | 0.98+ |
Rogers | ORGANIZATION | 0.98+ |
about half a year | QUANTITY | 0.97+ |
once a day | QUANTITY | 0.97+ |
Spark | TITLE | 0.97+ |
Spark Summit East 2017 | EVENT | 0.97+ |
first 22 days | QUANTITY | 0.97+ |
about 40 people | QUANTITY | 0.97+ |
next year | DATE | 0.97+ |
first thing | QUANTITY | 0.96+ |
First time | QUANTITY | 0.96+ |
Spark | ORGANIZATION | 0.95+ |
U.S. Social Security | ORGANIZATION | 0.95+ |
a week | QUANTITY | 0.95+ |
80 million customers | QUANTITY | 0.95+ |
Joey Echeverria, Rocana - On the Ground - #theCUBE
>> Announcer: theCUBE presents On The Ground. (light techno music) >> Hello, everyone. Welcome to a special, exclusive On the Ground CUBE coverage at Oracle Headquarters. I'm John Furrier, the cohost of theCUBE, and we're here with Joey Echeverria, Platform Technical Lead at Rocana here, talking about big data, cloud. Welcome to this On The Ground. >> Thanks for having me. >> So you guys are a digital native company. What's it like to be a digital native company these days, and what does that mean? >> Yeah, basically if you look across the industry, regardless of if you're in retail or manufacturing, your biggest competitors are the companies that have native digital advantages. What we mean by that is these are companies that you think of as tech companies, right? Amazon's competitive advantage in the retail space is that their entire business is instrumented, everything they do is collected. They collect logs and metrics for everything. They don't view IT as a separate organization, they view it as core to their business. And really what we do at Rocana is build tools to help companies that aren't digital native compete in that landscape, get a leg up, get the same kind of operational insight into their data and their customers, that they don't otherwise have. >> So that's an interesting comment about how IT is fundamental in their business model. In the traditional enterprise, the non-digital if you will, IT's a department. >> Joey: Exactly. >> So big data brings a connection to IT that gives them essentially a new lift, if you will, a new persona inside the company. Talk about that dynamic. >> Yeah, big data really gives you the technical foundation to build the tools and apps on top of those platforms that can compete with these digitally native companies. No longer do you need to go out and hire PhDs from Stanford or Berkeley. You can work with the same technology that they've built, that the open source community has built, and build on top of that, leverage the scalability, leverage the flexibility, and bring all of your data together so that you can start to answer the questions that you need to in order to drive the business forward. >> So do you think IT is more important with big data and some of the cloud technologies or less important? >> I think it starts to dissolve as a stand-alone department but it becomes ingrained in everything that a company does. Your IT department shouldn't just be fixing fax machines or printers, they should really be driving the way that you do your business and think about your business, what data you collect, how you interact with customers. Capturing all of those signals and turning that signal into noise-- Or sorry, filtering out the noise, turning the signal into action so that you can reach your customers and drive the business going forward. >> So IT becomes part of the fabric of the business model, so it's IT everywhere? >> Joey: Exactly, exactly. >> So what are you seeing out there that's disruptive right now, from your standpoint? You guys have a lot of customers that are on the front end of this big wave of data, cloud, and emerging technology. We're seeing certainly great innovations, machine learning, AI, cognitive, Ya know, soon Ford's going to have cars in five years, Uber's going to have self-driving cars in Pittsburgh by this year. I mean, this is a pretty interesting time. What are some of the cool things that you see happening around this dynamic of big-data-meets-IT? >> Yeah, I think one of the biggest things that we see in general is that folks want turnkey solutions. They don't want to have to think about all of the plumbing, they don't want to go out and buy a bunch of servers, rack them themselves, and figure out what's the right bill of materials. They want turnkey, whether that's cloud or physical appliances. And so that's one of the reasons why we work so well with Oracle on their Big Data Appliance. We can turn our application, which helps customers transform their business into being digital native, into a turnkey solution. They don't have to deal with all of the plumbing. They just know that they get a reliable platform that scales the way that they need to, and they're able to deploy these technologies much more rapidly. And we do the same thing with our cloud partners. >> So I got to the tough question. You guys are a start-up, certainly growing really fast, you got a lot of great technical people, but why not just do it yourself? Why partner with Oracle? >> Oh, that's a great question. I mean, Oracle has great reach in the marketplace, they're trusted. We don't want to solve every problem. We really want to partner with other companies, leverage their strengths, they can leverage our strengths and at the end of the day, what we end up building together is a much stronger solution than we could build ourselves. One of the main reasons why we in particular are not, say, a SAS company where we're just hosting everything in the cloud, is we need to go to where the data is and for a lot of these non-digital native companies, that data is still on-prem in their data centers. That being said, we're ready for the transition to the cloud. We have customers running our software in the cloud. We run everything in the cloud internally because, obviously as a small start-up, we don't want to go out and spend a lot of money on physical hardware. So we're really ready for both of those. >> Is this a big trend that you're seeing? 'Cause this is consistent with, some people say, the API economy. People can actually sling APIs together, build connectors, build a core product, but using API as a comprehensive solution is a mix between core and then outsourced, or partnering. Is that a trend that's beyond Rocana? >> Oh, definitely. One of the reasons why we build on top of open source software and open source standards is for that network effect. One of our core tenets is that we don't own the data. You own the data. So we store everything in file formats like Apache Parquet because it has the widest reach, the widest variety of tools that can access it. If there's a use case that you want to perform on our data that our application doesn't solve for you, fire up your Tableau, point it at the exact same data sets and go to town. The data is there for the customer, it's not there for us. >> What's the coolest thing that you're seeing right now in the marketplace, relative to disruption? You've got upcoming start-ups like you guys, Rocana, you got the big companies like Oracle, which are innovating now with opening up and not just being the proprietary database, using an open source. So what are some of the big things you're seeing right now between the dynamics of the big guys and the up-starts? >> Yeah, I think right now the biggest thing is turning data into the central cornerstone of everything that you're doing. No longer can you say, "I'm going to launch this project," without explaining what data are you going to collect, what are the metrics going to look like, how do we know if it's working, how do we know if it's not working. That sort of infusion of data everywhere, and even as you look across broader industry trends, things like IoT. IoT is really just the recognition that every device, every thing needs to have a connection to the network and a connection to the Internet and generate data. And then it's what you do with that data and tools that allow you to make sense of that data that are really going to drive you forward. >> IoT is a great example of your point about IT becoming the fabric because most IoT sensor stuff is not even connected to databases or IT. So now you're seeing this whole renaissance of IT getting into the edge of the network with all this IoT data. I mean, they have to be more diverse in their dealing with the data. >> Exactly, and that's why you need more native analytics. So one of the core parts of our platform is anomaly detection. Across all of your different devices in your data center, you're generating tons of data, tons of data. That data needs to be put into context. What may be a major shift is a problem with one data set isn't a problem with another. And so you have to have that historical context. That's one of the reasons why we also build on these big data platforms, is for things like security use cases. It takes, on average, nine months for you to actually detect that you've been breached. If you don't have the logs from nine months ago, you're not going to be able to find out how they got in, when they got in, so you really need that historical context to put the new data into the proper context and to be able to have the automated analytics that drive you and your analysis forward, rather than forcing you to sort of dumpster dive with just search and guess what's working. >> Dumpster diving into the data swamp, new buzzwords. Yeah, but this is really the big thing. The focus on real time seems to be the hot button, but you need data from a while back to mix in with the real time to get the right insight. Is that kind of the big trend? >> Oh yeah, absolutely. Whenever you talk about machine learning, you want the real time insights from it, but it's only as powerful as the historical data that you have to build those models. And so a big thing that we focus on how to make it easy to build those models, how to do it automatically, how to get away from having 500 different tuna bowls that the customer has to set, and really put it on autopilot. >> Well, making it easy, but also fast. It's got to get in low latency, that's another one. >> Oh absolutely. I mean, we leverage Kafka for just that reason. We're able to bring in millions of events per second into moderate size environments without breaking a sweat. >> Rocana, great stuff. Joey, great to chat with you again, here On The Ground at the Oracle Headquarters. I'm John Furrier, you're watching a special CUBE On The Ground here at Oracle Headquarters. Thanks for watching. (light techno music)
SUMMARY :
(light techno music) I'm John Furrier, the cohost of theCUBE, So you guys are a digital native company. that you think of as tech companies, right? In the traditional enterprise, the non-digital if you will, that gives them essentially a new lift, if you will, to answer the questions that you need to into action so that you can reach your customers You guys have a lot of customers that are on the front end that scales the way that they need to, So I got to the tough question. and at the end of the day, what we end up building together the API economy. One of the reasons why we build on top in the marketplace, relative to disruption? that are really going to drive you forward. getting into the edge of the network that drive you and your analysis forward, Is that kind of the big trend? that the customer has to set, It's got to get in low latency, that's another one. We're able to bring in millions of events per second Joey, great to chat with you again,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Joey Echeverria | PERSON | 0.99+ |
Uber | ORGANIZATION | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Oracle | ORGANIZATION | 0.99+ |
Ford | ORGANIZATION | 0.99+ |
John Furrier | PERSON | 0.99+ |
Joey | PERSON | 0.99+ |
Pittsburgh | LOCATION | 0.99+ |
One | QUANTITY | 0.99+ |
500 different tuna bowls | QUANTITY | 0.99+ |
nine months | QUANTITY | 0.99+ |
Rocana | ORGANIZATION | 0.99+ |
Rocana | PERSON | 0.99+ |
SAS | ORGANIZATION | 0.99+ |
both | QUANTITY | 0.99+ |
Tableau | TITLE | 0.99+ |
one | QUANTITY | 0.98+ |
five years | QUANTITY | 0.98+ |
this year | DATE | 0.98+ |
nine months ago | DATE | 0.97+ |
tons of data | QUANTITY | 0.96+ |
Stanford | ORGANIZATION | 0.92+ |
theCUBE | ORGANIZATION | 0.9+ |
Oracle Headquarters | LOCATION | 0.86+ |
millions of events per second | QUANTITY | 0.86+ |
Berkeley | LOCATION | 0.83+ |
one data set | QUANTITY | 0.81+ |
Apache Parquet | TITLE | 0.75+ |
Kafka | PERSON | 0.61+ |
tenets | QUANTITY | 0.6+ |
On the Ground | TITLE | 0.6+ |
On The Ground | TITLE | 0.54+ |
CUBE | EVENT | 0.22+ |
Jean-Pierre Dijcks, Oracle - On the Ground - #theCUBE
>> Narrator: The Cube presents, On the Ground. (techno music) >> Hi I'm Peter Burris, welcome to, an On the Ground here at Oracle Headquarters, with Silicon Angle media The Cube. Today we're talking to JP Dijcks, who is the master product manager inside, or one of the master product managers, inside Oracle's big data product group, welcome JP. >> Thank you Peter. >> Well, we're going to talk about how developers get access to this plethora, this miasma, this unbelievable complexity of data that's being made possible by IOT, traditional applications, and other sources, how are developers going to get access to this data? >> That's a good question Peter, I still think that one of the key aspects to getting access to that data is SQL, and so that's one of the ways we are driving, try to figure out, can we get the Oracle SQL engine, and all the richness of SQL analytics enabled on all of that data, no matter the what the format is, or no matter where it lives, how can I enable those SQL analytics on that, and then obviously we've all seemed to shift in APIs, and languages, like people don't necessarily always want to speak SQL and write SQL questions, or write SQL queries. So how do we then enable things like R, how do we enable plural, how do we enable Python, all sorts of things like that, how do we do that, and so the thought we had was, can we use SQL as the common meta-data interface? And the common structure around some of this, and enable all of these languages on top of that through the database. So that's kind of the baseline of what we're thinking of, of enabling this to developers and large communities of users. So that's SQL as an access method, do you also envision that SQL will also be a data creation language? As we think about how to envision big data coming together from a modeling perspective. >> So I think from a modeling perspective the meta-data part we certainly look at as a creation or definition language is probably the better word, how do I do structured queries, 'cause that's what SQL stands for, how do I do that on Jason documents, how do I do that on IOT data as you said, how do I get that done, and so we certainly want to create the meta-data, in like a very traditional data base catalog, or if you compare to a Hive Catalog, very much like that. The execution is very different, it uses the mechanisms under the cover that no SQL data bases have, or that Hadoop HDFS offer, and we certainly have no real interest in doing insert into Hadoop, 'cause the transaction mechanisms work very very differently, so its really focused on the meta-data areas and how do I expose that, how do I classify and categorize that data in ways people know and have seen for years. >> So that data manipulation will be handled by native tools, and some of the creations, some of the generation, some of the modeling will be handled now inside SQL, and there are a lot of SQL folks out there that have pretty good afinity for how to work with data. >> That's absolutely correct. >> So that's what it is, now how does it work? Tell us a bit about how this big data SQL is going to work, in a practical world. >> Okay. So we talked about the modeling already. The first step is that we extend the Oracle database and the catalog to understand things like Hive objects or HDFS kind of, where does stuff live. So we expanded and so we found a way to classify the meta-data first and foremost. The real magic is leveraging the Hadoop stack, so you ask a BI question and you want to join data in Oracle transactions, finance information, let's say with IOT data, which you'd reach out to HDFS for, big data SQL runs on the Hadoop notes, so it's local processing of that data, and it works exactly as HDFS and Hadoop work, in other words, I'm going to do processing local, I'm going to ask the name note which blocks am I supposed to read, that'll get run, we generate that query, we put it down to the Hadoop notes. And that's when some of the magic of SQL kicks in, which is really focused on performance, its performance, performance, performance, that's always the problem with federated data, how do I get it to perform across the board. And so what we took was, >> Predictably. >> Predictably, that's an interesting one, predictable performance, 'cause sometimes it works, sometimes it doesn't. So what we did is we took the exadata that was stored on the software, with all the magic as to how do I get a performance out of a file system out of IO, and we put that on the Hadoop notes, and then we push the queries all the way down to that software, and it does filtering, it does predicate pushdown, it leverages features like Parquet and ORC on the HDFS side, and at the end of the day, it kind of takes the IO requests, which is what a SQL query gives, feeds it to the Hadoop notes, runs it locally, and then sends it back to the database. And so we filter out a lot of the gunk we don't need, 'cause you said, oh I only need yesterdays data, or whatever the predicates are, and so that's how we think we can get an architecture ready that allows the global optimization, 'cause we can see the entire ecosystem in its totality, IOT, Oracle, all of it combined, we optimized the queries, push everything down as far as we can, algorithms to data, not data to algorithms, and that's how we're going to run this performance, predictably performance, on all of these pieces of data. >> So we end up with, if I got this right, let me recap, so we've got this notion that for data creation, data modeling, we can now use SQL, understood by a lot of people, doesn't preclude us from using native tools, but at least that's one place where we can see how it all comes together, we continue to use local tools for the actual manipulation elements. >> Absolutely. >> We are now using synergy like structures so we can push algorithm down to the data, so we're moving a small amount of data to a large amount of data, 'cause its cost down and improves predictability, but at the same time we've got meta-data objects that allow us to anticipate with some degree of predictability how this whole thing will run, and how this will come together back at the keynote, got that right? >> Got that right. >> Alright, so, next question is what's the impact of doing it this way? Talk a bit about, if you can, about how its helping folks who run data, who build applications, and who actually who are trying to get business value out of this whole process. >> So if we start with the business value, I think the biggest thing we bring to the table is simplicity, and standardization. If I have to understand how is this object represented in NoSQL, how in HDFS, how did somebody put a Jason file in here, I have to now spend time on literally digging through that, and then does it conform, do I have to modify it, what do I do? So I think the business value comes out of the SQL layer on top of it. It all looks exactly the same. It's well known, it's well understood, its far quicker to get from, I've got a bunch of data, to actually building a VI report, building a dashboard, building KPIs, and integrating that data, there's nothing new to data, its a level of abstraction we put on top of this, whether you use API or in this case we use SQL, 'cause that's the most common analytics language. So that's one part of how it will impact things. The 2nd is, and I think that's where the architecture is completely unique, we keep complete control of the query execution, from the meta-data we just talked about, and that enables us to do global optimization, and we can, and if you think this through a little bit, and go, oh global optimization sounds really cool, what does that mean? I can now actually start pushing processing, I can move data, and its what we've done in the exadata platform for years, data lives on disk, oh, Peter likes to query it very frequently, let's move it up to Flash, let's move it up to in-memory, let's twist the data around. So all the sudden we got control, we understand what gets queried, we understand where data lives, and we can start to optimize, exactly for the usage pattern the customer has, and that's always the performance aspect. And that goes to the old saying of, how can I get data as quickly to a customer when he really needs it, that's what this does, right, how can I optimize this? I've got thousands of people querying certain elements, move them up in the stack and get the performance and all these queries come back in like seconds. Regulatory stuff that needs to go through like five years of data, let's put it in cheap areas, and let's optimize that, and so the impact is cheaper and faster at the end of the day, and all 'cause there's a singular entity almost that governs the data, it governs the queries, it governs the usage patterns, that's what we uniquely bring to the table with this architecture. >> So I want to build on the notion of governance, because actually one of the interesting things you said was the idea that if its all under a common sort of interfaces, then you have greater visibility, where the data is, who owns it, et cetera. If you do this right, one of the biggest challenges that business are having is the global sense of how you govern your data. If you do this right, are you that much closer to having a competent overall data governance? >> I think we were able to set up a big step forward on it, and it sounds very simple, but we now have a central catalog, that actually understands what your data is and where it lives, in kind of like a well-known way, and again it sounds very simple but if you look at silos, that's the biggest problem, you have multiple silos, multiple things are in there, nobody knows really what's in there, so here we start to publish this in like a common structural layer, we have all the technical meta-data, we track who queries what, who does all those things, so that's a tremendous help in governance. The other side of course, because we still use native tools to let's say manipulate some data, or augment or add new data, we now are going to tie in a lot of the meta-data, that comes from say the Hadoop ecosystem, again into this catalog, and while we're probably not there yet just today on the end to end governance everything's kind of out of the box, here we go. >> And probably never will be. >> And we probably never will, you're right, and I think we set a major step forward with just consolidating it, and exposing people to all the data the have, and you can run all the other tools like, crawl my data and check box anything that says SSN, or looks like a social security number, all of those tools are are still relevant. We just have a consolidated view, dramatically improved governance. >> So I'm going to throw you a curve ball. >> Sure. >> Not all data I want to use is inside my business, or is being generated by sensors that I control, how does big data SQL and related technologies play a role in the actual contracting for additional data sources, and sustaining those relationships that are very very fundamental, how data's shared across organizations. Do you see this information being brought in under this umbrella? Do you see Oracle facilitating those types of relationships, introducing standards for data sharing across partnerships becomes even easier? >> I'm not convinced that big data SQL as a technology is going to solve all the problems we see there, I'm absolutely convinced that Oracle is going to work towards that, you see it in so many acquisitions we've done, you see it in the efforts of making data as a service available to people, and to some extent big data SQL will be a foundation layer to make BI queries run smoother across more and more and more pillars of data. If we can integrate database, Hadoop, and NoSQL, there's nothing that says, oh and by the way, storage cloud. >> And we have relatively common physical governance, that I have the same physical governance, and you have the same physical governance, now its easier for us to show how we can introduce governance across our instances. >> Absolutely, and today we focus a lot on HDFS or Hadoop as the next data pillar, storage cloud, ground to cloud, all of those are on the roadmap for big data SQL to catch up with that, and so if you have data as a service, let's declare that cloud for a second, and I have data in my database in my Hadoop cluster, again, all now becomes part of the same ecosystem of data, and it all looks the same to me from a BI query perspective, from an analytics perspective. And then the, how do I get the data sharing standards set up and all that, part of that is driving a lot of it into cloud, and making it all as a service, 'cause again you put a level of abstraction on top of it, that makes it easier to consume, understand where it came from, and capture the meta-data. >> So JP one last question. >> Sure. >> Oracle opens worlds on the horizon, what are you looking for, or what will your customers be looking for as it pertains to this big data SQL and related technologies? >> I think specifically from a big data SQL perspective, is we're going to drive the possible adoption scope much much further, today we work with HDFS an we work with Oracle database, we're going to announce certain things like exadata, Hadoop will be supportive, we hold down super cluster support, we're going to dramatically expand the footprint big data SQL will run on, people who come for big data SQL or analytics sessions you'll see a lot of the roadmap looking far more forward. I already mentioned some things like ground to cloud, how can I run big data SQL when my exadata is on Premis, and then the rest of my HDFS data is in the cloud, we're going to be talking about how we're going to do that, and what do we think the evolution of big data SQL is going to be, I think that's going to be a very fun session to go to. >> JP Dijcks, a master product manager inside the Oracle big data product group, thank you very much for joining us here On the Ground, at Oracle headquarters, this is The Cube.
SUMMARY :
Narrator: The Cube presents, On the Ground. or one of the master product managers, and so that's one of the ways we are driving, and so we certainly want to create the meta-data, and some of the creations, some of the generation, So that's what it is, now how does it work? and the catalog to understand things like Hive objects and so that's how we think we can get an architecture ready So we end up with, if I got this right, let me recap, and who actually who are trying to get business value out of and we can, and if you think this through a little bit, because actually one of the interesting things you said everything's kind of out of the box, here we go. and I think we set a major step forward and sustaining those relationships that are and to some extent big data SQL will be a foundation and you have the same physical governance, Absolutely, and today we focus a lot on HDFS or Hadoop and what do we think the evolution the Oracle big data product group,
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Peter Burris | PERSON | 0.99+ |
Peter | PERSON | 0.99+ |
JP Dijcks | PERSON | 0.99+ |
Jean-Pierre Dijcks | PERSON | 0.99+ |
Oracle | ORGANIZATION | 0.99+ |
JP | PERSON | 0.99+ |
five years | QUANTITY | 0.99+ |
Jason | PERSON | 0.99+ |
Python | TITLE | 0.99+ |
SQL | TITLE | 0.99+ |
NoSQL | TITLE | 0.99+ |
2nd | QUANTITY | 0.98+ |
first step | QUANTITY | 0.98+ |
today | DATE | 0.98+ |
HDFS | ORGANIZATION | 0.98+ |
Today | DATE | 0.98+ |
one | QUANTITY | 0.97+ |
Hadoop | TITLE | 0.96+ |
Parquet | TITLE | 0.96+ |
one part | QUANTITY | 0.95+ |
The Cube | ORGANIZATION | 0.95+ |
thousands of people | QUANTITY | 0.94+ |
yesterdays | DATE | 0.94+ |
ORC | TITLE | 0.93+ |
Silicon Angle | ORGANIZATION | 0.92+ |
Flash | TITLE | 0.91+ |
first | QUANTITY | 0.84+ |
years | QUANTITY | 0.83+ |
a second | QUANTITY | 0.82+ |
the Ground | TITLE | 0.82+ |
Hadoop HDFS | TITLE | 0.81+ |
a lot of people | QUANTITY | 0.8+ |
one place | QUANTITY | 0.77+ |
The Cube | TITLE | 0.6+ |
singular | QUANTITY | 0.58+ |
Narrator | TITLE | 0.54+ |
last question | QUANTITY | 0.52+ |
IOT | ORGANIZATION | 0.37+ |