Anais Dotis Georgiou, InfluxData | Evolving InfluxDB into the Smart Data Platform
>>Okay, we're back. I'm Dave Valante with The Cube and you're watching Evolving Influx DB into the smart data platform made possible by influx data. Anna East Otis Georgio is here. She's a developer advocate for influx data and we're gonna dig into the rationale and value contribution behind several open source technologies that Influx DB is leveraging to increase the granularity of time series analysis analysis and bring the world of data into realtime analytics. Anna is welcome to the program. Thanks for coming on. >>Hi, thank you so much. It's a pleasure to be here. >>Oh, you're very welcome. Okay, so IO X is being touted as this next gen open source core for Influx db. And my understanding is that it leverages in memory, of course for speed. It's a kilo store, so it gives you compression efficiency, it's gonna give you faster query speeds, it gonna use store files and object storages. So you got very cost effective approach. Are these the salient points on the platform? I know there are probably dozens of other features, but what are the high level value points that people should understand? >>Sure, that's a great question. So some of the main requirements that IOCs is trying to achieve and some of the most impressive ones to me, the first one is that it aims to have no limits on cardinality and also allow you to write any kind of event data that you want, whether that's lift tag or a field. It also wants to deliver the best in class performance on analytics queries. In addition to our already well served metrics queries, we also wanna have operator control over memory usage. So you should be able to define how much memory is used for buffering caching and query processing. Some other really important parts is the ability to have bulk data export and import, super useful. Also, broader ecosystem compatibility where possible we aim to use and embrace emerging standards in the data analytics ecosystem and have compatibility with things like sql, Python, and maybe even pandas in the future. >>Okay, so a lot there. Now we talked to Brian about how you're using Rust and and which is not a new programming language and of course we had some drama around Russ during the pandemic with the Mozilla layoffs, but the formation of the Russ Foundation really addressed any of those concerns. You got big guns like Amazon and Google and Microsoft throwing their collective weights behind it. It's really, adoption is really starting to get steep on the S-curve. So lots of platforms, lots of adoption with rust, but why rust as an alternative to say c plus plus for example? >>Sure, that's a great question. So Rust was chosen because of his exceptional performance and rebi reliability. So while rust is synt tactically similar to c c plus plus and it has similar performance, it also compiles to a native code like c plus plus. But unlike c plus plus, it also has much better memory safety. So memory safety is protection against bugs or security vulnerabilities that lead to excessive memory usage or memory leaks. And rust achieves this memory safety due to its like innovative type system. Additionally, it doesn't allow for dangling pointers and dangling pointers are the main classes of errors that lead to exploitable security vulnerabilities in languages like c plus plus. So Russ like helps meet that requirement of having no limits on card for example, because it's, we're also using the Russ implementation of Apache Arrow and this control over memory and also Russ, Russ Russ's packaging system called crates IO offers everything that you need out of the box to have features like AY and a weight to fixed race conditions to protect against buffering overflows and to ensure thread safe ay caching structures as well. So essentially it's just like has all the control, all the fine grain control, you need to take advantage of memory and all your resources as well as possible so that you can handle those really, really high ity use cases. >>Yeah, and the more I learned about the the new engine and the, and the platform IOCs et cetera, you know, you, you see things like, you know, the old days not even to even today you do a lot of garbage collection in these, in these systems and there's an inverse, you know, impact relative to performance. So it looks like you're really, you know, the community is modernizing the platform, but I wanna talk about Apache Arrow for a moment. It's designed to address the constraints that are associated with analyzing large data sets. We, we know that, but please explain why, what, what is Arrow and and what does it bring to Influx db? >>Sure, yeah. So Arrow is a, a framework for defining in memory calmer data and so much of the efficiency and performance of IOCs comes from taking advantage of calmer data structures. And I will, if you don't mind, take a moment to kind of illustrate why calmer data structures are so valuable. Let's pretend that we are gathering field data about the temperature in our room and also maybe the temperature of our stove. And in our table we have those two temperature values as well as maybe a measurement value, timestamp value, maybe some other tag values that describe what room and what house, et cetera we're getting this data from. And so you can picture this table where we have like two rows with the two temperature values for both our room and the stove. Well usually our room temperature is regulated so those values don't change very often. >>So when you have calm oriented st calm oriented storage, essentially you take each row, each column and group it together. And so if that's the case and you're just taking temperature values from the room and a lot of those temperature values are the same, then you'll, you might be able to imagine how equal values will then neighbor each other and when they neighbor each other in the storage format. This provides a really perfect opportunity for cheap compression. And then this cheap compression enables high cardinality use cases. It also enables for faster scan rates. So if you wanna define like the min and max value of the temperature in the room across a thousand different points, you only have to get those a thousand different points in order to answer that question and you have those immediately available to you. But let's contrast this with a row oriented storage solution instead so that we can understand better the benefits of calmer oriented storage. >>So if you had a row oriented storage, you'd first have to look at every field like the temperature in, in the room and the temperature of the stove. You'd have to go across every tag value that maybe describes where the room is located or what model the stove is. And every timestamp you'd then have to pluck out that one temperature value that you want at that one times stamp and do that for every single row. So you're scanning across a ton more data and that's why row oriented doesn't provide the same efficiency as calmer and Apache Arrow is in memory calmer data, calmer data fit framework. So that's where a lot of the advantages come >>From. Okay. So you've basically described like a traditional database, a row approach, but I've seen like a lot of traditional databases say, okay, now we've got, we can handle colo format versus what you're talking about is really, you know, kind of native it, is it not as effective as the, is the form not as effective because it's largely a, a bolt on? Can you, can you like elucidate on that front? >>Yeah, it's, it's not as effective because you have more expensive compression and because you can't scan across the values as quickly. And so those are, that's pretty much the main reasons why, why RO row oriented storage isn't as efficient as calm, calmer oriented storage. >>Yeah. Got it. So let's talk about Arrow data fusion. What is data fusion? I know it's written in rust, but what does it bring to to the table here? >>Sure. So it's an extensible query execution framework and it uses Arrow as its in memory format. So the way that it helps influx DB IOx is that okay, it's great if you can write unlimited amount of cardinality into influx cbis, but if you don't have a query engine that can successfully query that data, then I don't know how much value it is for you. So data fusion helps enable the, the query process and transformation of that data. It also has a PANDAS API so that you could take advantage of PDA's data frames as well and all of the machine learning tools associated with pandas. >>Okay. You're also leveraging par K in the platform course. We heard a lot about Par K in the middle of the last decade cuz as a storage format to improve on Hadoop column stores. What are you doing with Par K and why is it important? >>Sure. So Par K is the calm oriented durable file format. So it's important because it'll enable bulk import and bulk export. It has compatibility with Python and pandas so it supports a broader ecosystem. Parque files also take very little disc disc space and they're faster to scan because again they're column oriented in particular, I think PAR K files are like 16 times cheaper than CSV files, just as kind of a point of reference. And so that's essentially a lot of the, the benefits of par k. >>Got it. Very popular. So and these, what exactly is influx data focusing on as a committer to these projects? What is your focus? What's the value that you're bringing to the community? >>Sure. So Influx DB first has contributed a lot of different, different things to the Apache ecosystem. For example, they contribute an implementation of Apache Arrow and go and that will support clearing with flux. Also, there has been a quite a few contributions to data fusion for things like memory optimization and supportive additional SQL features like support for timestamp, arithmetic and support for exist clauses and support for memory control. So yeah, Influx has contributed a a lot to the Apache ecosystem and continues to do so. And I think kind of the idea here is that if you can improve these upstream projects and then the long term strategy here is that the more you contribute and build those up, then the more you will perpetuate that cycle of improvement and the more we will invest in our own project as well. So it's just that kind of symbiotic relationship and appreciation of the open source community. >>Yeah. Got it. You got that virtuous cycle going, the people call it the flywheel. Give us your last thoughts and kind of summarize, you know, where what, what the big takeaways are from your perspective. >>So I think the big takeaway is that influx data is doing a lot of really exciting things with Influx DB IOCs and I really encourage if you are interested in learning more about the technologies that Influx is leveraging to produce IOCs, the challenges associated with it and all of the hard work questions and I just wanna learn more, then I would encourage you to go to the monthly tech talks and community office hours and they are on every second Wednesday of the month at 8:30 AM Pacific time. There's also a community forums and a community Slack channel. Look for the influx D DB underscore IAC channel specifically to learn more about how to join those office hours and those monthly tech tech talks as well as ask any questions they have about IOCs, what to expect and what you'd like to learn more about. I as a developer advocate, I wanna answer your questions. So if there's a particular technology or stack that you wanna dive deeper into and want more explanation about how influx TB leverages it to build IOCs, I will be really excited to produce content on that topic for you. >>Yeah, that's awesome. You guys have a really rich community, collaborate with your peers, solve problems, and you guys super responsive, so really appreciate that. All right, thank you so much and East for explaining all this open source stuff to the audience and why it's important to the future of data. >>Thank you. I really appreciate it. >>All right, you're very welcome. Okay, stay right there and in a moment I'll be back with Tim Yokum. He's the director of engineering for Influx Data and we're gonna talk about how you update a SaaS engine while the plane is flying at 30,000 feet. You don't wanna miss this.
SUMMARY :
to increase the granularity of time series analysis analysis and bring the world of data Hi, thank you so much. So you got very cost effective approach. it aims to have no limits on cardinality and also allow you to write any kind of event data that So lots of platforms, lots of adoption with rust, but why rust as an all the fine grain control, you need to take advantage of even to even today you do a lot of garbage collection in these, in these systems and And so you can picture this table where we have like two rows with the two temperature values for order to answer that question and you have those immediately available to you. to pluck out that one temperature value that you want at that one times stamp and do that for every about is really, you know, kind of native it, is it not as effective as the, Yeah, it's, it's not as effective because you have more expensive compression and because So let's talk about Arrow data fusion. It also has a PANDAS API so that you could take advantage of What are you doing with So it's important What's the value that you're bringing to the community? here is that the more you contribute and build those up, then the kind of summarize, you know, where what, what the big takeaways are from your perspective. So if there's a particular technology or stack that you wanna dive deeper into and want and you guys super responsive, so really appreciate that. I really appreciate it. Influx Data and we're gonna talk about how you update a SaaS engine while
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Tim Yokum | PERSON | 0.99+ |
Jeff Frick | PERSON | 0.99+ |
Brian | PERSON | 0.99+ |
Anna | PERSON | 0.99+ |
James Bellenger | PERSON | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
Dave Valante | PERSON | 0.99+ |
James | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
three months | QUANTITY | 0.99+ |
16 times | QUANTITY | 0.99+ |
ORGANIZATION | 0.99+ | |
Python | TITLE | 0.99+ |
mobile.twitter.com | OTHER | 0.99+ |
Influx Data | ORGANIZATION | 0.99+ |
iOS | TITLE | 0.99+ |
ORGANIZATION | 0.99+ | |
30,000 feet | QUANTITY | 0.99+ |
Russ Foundation | ORGANIZATION | 0.99+ |
Scala | TITLE | 0.99+ |
Twitter Lite | TITLE | 0.99+ |
two rows | QUANTITY | 0.99+ |
200 megabyte | QUANTITY | 0.99+ |
Node | TITLE | 0.99+ |
Three months ago | DATE | 0.99+ |
one application | QUANTITY | 0.99+ |
both places | QUANTITY | 0.99+ |
each row | QUANTITY | 0.99+ |
Par K | TITLE | 0.99+ |
Anais Dotis Georgiou | PERSON | 0.99+ |
one language | QUANTITY | 0.98+ |
first one | QUANTITY | 0.98+ |
15 engineers | QUANTITY | 0.98+ |
Anna East Otis Georgio | PERSON | 0.98+ |
both | QUANTITY | 0.98+ |
one second | QUANTITY | 0.98+ |
25 engineers | QUANTITY | 0.98+ |
About 800 people | QUANTITY | 0.98+ |
sql | TITLE | 0.98+ |
Node Summit 2017 | EVENT | 0.98+ |
two temperature values | QUANTITY | 0.98+ |
one times | QUANTITY | 0.98+ |
c plus plus | TITLE | 0.97+ |
Rust | TITLE | 0.96+ |
SQL | TITLE | 0.96+ |
today | DATE | 0.96+ |
Influx | ORGANIZATION | 0.95+ |
under 600 kilobytes | QUANTITY | 0.95+ |
first | QUANTITY | 0.95+ |
c plus plus | TITLE | 0.95+ |
Apache | ORGANIZATION | 0.95+ |
par K | TITLE | 0.94+ |
React | TITLE | 0.94+ |
Russ | ORGANIZATION | 0.94+ |
About three months ago | DATE | 0.93+ |
8:30 AM Pacific time | DATE | 0.93+ |
twitter.com | OTHER | 0.93+ |
last decade | DATE | 0.93+ |
Node | ORGANIZATION | 0.92+ |
Hadoop | TITLE | 0.9+ |
InfluxData | ORGANIZATION | 0.89+ |
c c plus plus | TITLE | 0.89+ |
Cube | ORGANIZATION | 0.89+ |
each column | QUANTITY | 0.88+ |
InfluxDB | TITLE | 0.86+ |
Influx DB | TITLE | 0.86+ |
Mozilla | ORGANIZATION | 0.86+ |
DB IOx | TITLE | 0.85+ |
Anais Dotis Georgiou, InfluxData
(upbeat music) >> Okay, we're back. I'm Dave Vellante with The Cube and you're watching Evolving InfluxDB into the smart data platform made possible by influx data. Anais Dotis-Georgiou is here. She's a developer advocate for influx data and we're going to dig into the rationale and value contribution behind several open source technologies that InfluxDB is leveraging to increase the granularity of time series analysis and bring the world of data into realtime analytics. Anais welcome to the program. Thanks for coming on. >> Hi, thank you so much. It's a pleasure to be here. >> Oh, you're very welcome. Okay, so IOx is being touted as this next gen open source core for InfluxDB. And my understanding is that it leverages in memory, of course for speed. It's a kilometer store, so it gives you compression efficiency it's going to give you faster query speeds, it's going to see you store files and object storages so you got very cost effective approach. Are these the salient points on the platform? I know there are probably dozens of other features but what are the high level value points that people should understand? >> Sure, that's a great question. So some of the main requirements that IOx is trying to achieve and some of the most impressive ones to me the first one is that it aims to have no limits on cardinality and also allow you to write any kind of event data that you want whether that's lift tag or a field. It also wants to deliver the best in class performance on analytics queries. In addition to our already well served metric queries we also want to have operator control over memory usage. So you should be able to define how much memory is used for buffering caching and query processing. Some other really important parts is the ability to have bulk data export and import, super useful. Also, broader ecosystem compatibility where possible we aim to use and embrace emerging standards in the data analytics ecosystem and have compatibility with things like SQL, Python and maybe even Pandas in the future. >> Okay, so a lot there. Now we talked to Brian about how you're using Rust and which is not a new programming language and of course we had some drama around Rust during the pandemic with the Mozilla layoffs but the formation of the Rust Foundation really addressed any of those concerns and you got big guns like Amazon and Google and Microsoft throwing their collective weights behind it. It's really adoption is really starting to get steep on the S-curve. So lots of platforms, lots of adoption with Rust but why Rust as an alternative to say C++ for example? >> Sure, that's a great question. So Rust was chosen because of his exceptional performance and reliability. So while Rust is syntactically similar to C++ and it has similar performance it also compiles to a native code like C++ But unlike C++ it also has much better memory safety. So memory safety is protection against bugs or security vulnerabilities that lead to excessive memory usage or memory leaks. And Rust achieves this memory safety due to its like innovative type system. Additionally, it doesn't allow for dangling pointers and dangling pointers are the main classes of errors that lead to exploitable security vulnerabilities in languages like C++. So Rust like helps meet that requirement of having no limits on cardinality, for example, because it's we're also using the Rust implementation of Apache Arrow and this control over memory and also Rust's packaging system called Crates IO offers everything that you need out of the box to have features like async and await to fix race conditions to protect against buffering overflows and to ensure thread safe async caching structures as well. So essentially it's just like has all the control all the fine grain control, you need to take advantage of memory and all your resources as well as possible so that you can handle those really, really high cardinality use cases. >> Yeah, and the more I learn about the new engine and the platform IOx et cetera, you see things like the old days not even to even today you do a lot of garbage collection in these systems and there's an inverse, impact relative to performance. So it looks like you're really, the community is modernizing the platform but I want to talk about Apache Arrow for a moment. It's designed to address the constraints that are associated with analyzing large data sets. We know that, but please explain why, what is Arrow and what does it bring to InfluxDB? >> Sure. Yeah. So Arrow is a a framework for defining in memory column data. And so much of the efficiency and performance of IOx comes from taking advantage of column data structures. And I will, if you don't mind, take a moment to kind of illustrate why column data structures are so valuable. Let's pretend that we are gathering field data about the temperature in our room and also maybe the temperature of our store. And in our table we have those two temperature values as well as maybe a measurement value, timestamp value maybe some other tag values that describe what room and what house, et cetera we're getting this data from. And so you can picture this table where we have like two rows with the two temperature values for both our room and the store. Well, usually our room temperature is regulated so those values don't change very often. So when you have calm oriented storage essentially you take each row each column and group it together. And so if that's the case and you're just taking temperature values from the room and a lot of those temperature values are the same then you'll, you might be able to imagine how equal values will then enable each other and when they neighbor each other in the storage format this provides a really perfect opportunity for cheap compression. And then this cheap compression enables high cardinality use cases. It also enables for faster scan rates. So if you want to define like the min and max value of the temperature in the room across a thousand different points you only have to get those a thousand different points in order to answer that question and you have those immediately available to you. But let's contrast this with a row oriented storage solution instead so that we can understand better the benefits of column oriented storage. So if you had a row oriented storage, you'd first have to look at every field like the temperature in the room and the temperature of the store. You'd have to go across every tag value that maybe describes where the room is located or what model the store is. And every timestamp you then have to pluck out that one temperature value that you want at that one time stamp and do that for every single row. So you're scanning across a ton more data and that's why row oriented doesn't provide the same efficiency as column and Apache Arrow is in memory column data column data fit framework. So that's where a lot of the advantages come from. >> Okay. So you've basically described like a traditional database a row approach, but I've seen like a lot of traditional databases say, okay, now we've got we can handle Column format versus what you're talking about is really kind of native is it not as effective as the former not as effective because it's largely a bolt on? Can you like elucidate on that front? >> Yeah, it's not as effective because you have more expensive compression and because you can't scan across the values as quickly. And so those are, that's pretty much the main reasons why row oriented storage isn't as efficient as column oriented storage. >> Yeah. Got it. So let's talk about Arrow data fusion. What is data fusion? I know it's written in Rust but what does it bring to to the table here? >> Sure. So it's an extensible query execution framework and it uses Arrow as its in memory format. So the way that it helps InfluxDB IOx is that okay it's great if you can write unlimited amount of cardinality into InfluxDB, but if you don't have a query engine that can successfully query that data then I don't know how much value it is for you. So data fusion helps enable the query process and transformation of that data. It also has a Pandas API so that you could take advantage of Pandas data frames as well and all of the machine learning tools associated with Pandas. >> Okay. You're also leveraging Par-K in the platform course. We heard a lot about Par-K in the middle of the last decade cuz as a storage format to improve on Hadoop column stores. What are you doing with Par-K and why is it important? >> Sure. So Par-K is the column oriented durable file format. So it's important because it'll enable bulk import and bulk export. It has compatibility with Python and Pandas so it supports a broader ecosystem. Par-K files also take very little disc space and they're faster to scan because again they're column oriented, in particular I think Par-K files are like 16 times cheaper than CSV files, just as kind of a point of reference. And so that's essentially a lot of the benefits of Par-K. >> Got it. Very popular. So and these, what exactly is Influx data focusing on as a committer to these projects? What is your focus? What's the value that you're bringing to the community? >> Sure. So InfluxDB first has contributed a lot of different things to the Apache ecosystem. For example, they contribute an implementation of Apache Arrow and go and that will support clearing Influx. Also, there has been a quite a few contributions to data fusion for things like memory optimization and supportive additional SQL features like support for timestamp, arithmetic and support for exist clauses and support for memory control. So yeah, Influx has contributed a lot to the Apache ecosystem and continues to do so. And I think kind of the idea here is that if you can improve these upstream projects and then the long term strategy here is that the more you contribute and build those up then the more you will perpetuate that cycle of improvement and the more we will invest in our own project as well. So it's just that kind of symbiotic relationship and appreciation of the open source community. >> Yeah. Got it. You got that virtuous cycle going people call it the flywheel. Give us your last thoughts and kind of summarize, what the big takeaways are from your perspective. >> So I think the big takeaway is that, Influx data is doing a lot of really exciting things with InfluxDB IOx and I really encourage if you are interested in learning more about the technologies that Influx is leveraging to produce IOx the challenges associated with it and all of the hard work questions and I just want to learn more then I would encourage you to go to the monthly Tech talks and community office hours and they are on every second Wednesday of the month at 8:30 AM Pacific time. There's also a community forums and a community Slack channel. Look for the InfluxDB underscore IOx channel specifically to learn more about how to join those office hours and those monthly tech talks as well as ask any questions they have about IOx what to expect and what you'd like to learn more about. I as a developer advocate, I want to answer your questions. So if there's a particular technology or stack that you want to dive deeper into and want more explanation about how InfluxDB leverages it to build IOx, I will be really excited to produce content on that topic for you. >> Yeah, that's awesome. You guys have a really rich community collaborate with your peers, solve problems and you guys super responsive, so really appreciate that. All right, thank you so much Anais for explaining all this open source stuff to the audience and why it's important to the future of data. >> Thank you. I really appreciate it. >> All right, you're very welcome. Okay, stay right there and in a moment I'll be back with Tim Yoakam. He's the director of engineering for Influx Data and we're going to talk about how you update a SaaS engine while the plane is flying at 30,000 feet. You don't want to miss this. (upbeat music)
SUMMARY :
and bring the world of data It's a pleasure to be here. it's going to give you and some of the most impressive ones to me and you got big guns and dangling pointers are the main classes Yeah, and the more I and the temperature of the store. is it not as effective as the former not and because you can't scan to to the table here? So the way that it helps Par-K in the platform course. and they're faster to scan So and these, what exactly is Influx data and appreciation of the and kind of summarize, of the hard work questions and you guys super responsive, I really appreciate it. and we're going to talk about
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Tim Yoakam | PERSON | 0.99+ |
Brian | PERSON | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
Microsoft | ORGANIZATION | 0.99+ |
ORGANIZATION | 0.99+ | |
Anais | PERSON | 0.99+ |
two rows | QUANTITY | 0.99+ |
16 times | QUANTITY | 0.99+ |
Influx Data | ORGANIZATION | 0.99+ |
each row | QUANTITY | 0.99+ |
Python | TITLE | 0.99+ |
Rust | TITLE | 0.99+ |
C++ | TITLE | 0.99+ |
SQL | TITLE | 0.99+ |
Anais Dotis Georgiou | PERSON | 0.99+ |
InfluxDB | TITLE | 0.99+ |
both | QUANTITY | 0.99+ |
Rust Foundation | ORGANIZATION | 0.99+ |
30,000 feet | QUANTITY | 0.99+ |
first one | QUANTITY | 0.99+ |
Mozilla | ORGANIZATION | 0.99+ |
Pandas | TITLE | 0.98+ |
InfluxData | ORGANIZATION | 0.98+ |
Influx | ORGANIZATION | 0.98+ |
IOx | TITLE | 0.98+ |
each column | QUANTITY | 0.97+ |
one time stamp | QUANTITY | 0.97+ |
first | QUANTITY | 0.97+ |
Influx | TITLE | 0.96+ |
Anais Dotis-Georgiou | PERSON | 0.95+ |
Crates IO | TITLE | 0.94+ |
IOx | ORGANIZATION | 0.94+ |
two temperature values | QUANTITY | 0.93+ |
Apache | ORGANIZATION | 0.93+ |
today | DATE | 0.93+ |
8:30 AM Pacific time | DATE | 0.92+ |
Wednesday | DATE | 0.91+ |
one temperature | QUANTITY | 0.91+ |
two temperature values | QUANTITY | 0.91+ |
InfluxDB IOx | TITLE | 0.9+ |
influx | ORGANIZATION | 0.89+ |
last decade | DATE | 0.88+ |
single row | QUANTITY | 0.83+ |
a ton more data | QUANTITY | 0.81+ |
thousand | QUANTITY | 0.8+ |
dozens of other features | QUANTITY | 0.8+ |
a thousand different points | QUANTITY | 0.79+ |
Hadoop | TITLE | 0.77+ |
Par-K | TITLE | 0.76+ |
points | QUANTITY | 0.75+ |
each | QUANTITY | 0.75+ |
Slack | TITLE | 0.74+ |
Evolving InfluxDB | TITLE | 0.68+ |
kilometer | QUANTITY | 0.67+ |
Arrow | TITLE | 0.62+ |
The Cube | ORGANIZATION | 0.61+ |
Evolving InfluxDB into the Smart Data Platform Open
>> This past May, the Cube, in collaboration with Influx Data shared with you the latest innovations in Time series databases. We talked at length about why a purpose-built time series database for many use cases, was a superior alternative to general purpose databases trying to do the same thing. Now, you may, you may remember that time series data is any data that's stamped in time and if it's stamped, it can be analyzed historically. And when we introduced the concept to the community we talked about how in theory those time slices could be taken, you know every hour, every minute, every second, you know, down to the millisecond and how the world was moving toward realtime or near realtime data analysis to support physical infrastructure like sensors, and other devices and IOT equipment. Time series databases have had to evolve to efficiently support realtime data in emerging use, use cases in IOT and other use cases. And to do that, new architectural innovations have to be brought to bear. As is often the case, open source software is the linchpin to those innovations. Hello and welcome to Evolving Influx DB into the Smart Data platform, made possible by influx data and produced by the cube. My name is Dave Vellante, and I'll be your host today. Now, in this program, we're going to dig pretty deep into what's happening with Time series data generally, and specifically how Influx DB is evolving to support new workloads and demands and data, and specifically around data analytics use cases in real time. Now, first we're going to hear from Brian Gilmore who is the director of IOT and emerging technologies at Influx Data. And we're going to talk about the continued evolution of Influx DB and the new capabilities enabled by open source generally and specific tools. And in this program, you're going to hear a lot about things like rust implementation of Apache Arrow, the use of Parquet and tooling such as data fusion, which are powering a new engine for Influx db. Now, these innovations, they evolve the idea of time series analysis by dramatically increasing the granularity of time series data by compressing the historical time slices if you will, from, for example minutes down to milliseconds. And at the same time, enabling real time analytics with an architecture that can process data much faster and much more efficiently. Now, after Brian, we're going to hear from Anais Dotis-Georgiou who is a developer advocate at Influx Data. And we're going to get into the "why's" of these open source capabilities, and how they contribute to the evolution of the Influx DB platform. And then we're going to close the program with Tim Yocum. He's the director of engineering at Influx Data, and he's going to explain how the Influx DB community actually evolved the data engine in mid-flight and which decisions went into the innovations that are coming to the market. Thank you for being here. We hope you enjoy the program. Let's get started.
SUMMARY :
by compressing the historical time slices
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Brian Gilmore | PERSON | 0.99+ |
Dave Vellante | PERSON | 0.99+ |
Brian | PERSON | 0.99+ |
Tim Yocum | PERSON | 0.99+ |
Influx Data | ORGANIZATION | 0.99+ |
Anais Dotis-Georgiou | PERSON | 0.99+ |
Influx DB | TITLE | 0.99+ |
InfluxDB | TITLE | 0.94+ |
first | QUANTITY | 0.91+ |
today | DATE | 0.88+ |
second | QUANTITY | 0.85+ |
Time | TITLE | 0.82+ |
Parquet | TITLE | 0.76+ |
Apache | ORGANIZATION | 0.75+ |
past May | DATE | 0.75+ |
Influx | TITLE | 0.75+ |
IOT | ORGANIZATION | 0.69+ |
Cube | ORGANIZATION | 0.65+ |
influx | ORGANIZATION | 0.53+ |
Arrow | TITLE | 0.48+ |
Brian Gilmore, InfluxData
(soft upbeat music) >> Okay, we're kicking things off with Brian Gilmore. He's the director of IoT, an emerging technology at InfluxData. Brian, welcome to the program. Thanks for coming on. >> Thanks, Dave, great to be here. I appreciate the time. >> Hey, explain why InfluxDB, you know, needs a new engine. Was there something wrong with the current engine? What's going on there? >> No, no, not at all. I mean, I think, for us it's been about staying ahead of the market. I think, you know, if we think about what our customers are coming to us sort of with now, you know, related to requests like SQL query support, things like that, we have to figure out a way to execute those for them in a way that will scale long term. And then we also want to make sure we're innovating, we're sort of staying ahead of the market as well, and sort of anticipating those future needs. So, you know, this is really a transparent change for our customers. I mean, I think we'll be adding new capabilities over time that sort of leverage this new engine. But, you know, initially, the customers who are using us are going to see just great improvements in performance, you know, especially those that are working at the top end of the workload scale, you know, the massive data volumes and things like that. >> Yeah, and we're going to get into that today and the architecture and the like. But what was the catalyst for the enhancements? I mean, when and how did this all come about? >> Well, I mean, like three years ago, we were primarily on premises, right? I mean, I think we had our open source, we had an enterprise product. And sort of shifting that technology, especially the open source code base to a service basis where we were hosting it through, you know, multiple cloud providers. That was a long journey. (chuckles) I guess, you know, phase one was, we wanted to host enterprise for our customers, so we sort of created a service that we just managed and ran our enterprise product for them. You know, phase two of this cloud effort was to optimize for like multi-tenant, multi-cloud, be able to host it in a truly like SAS manner where we could use, you know, some type of customer activity or consumption as the pricing vector. And that was sort of the birth of the real first InfluxDB cloud, you know, which has been really successful. We've seen, I think, like 60,000 people sign up. And we've got tons and tons of both enterprises as well as like new companies, developers, and of course a lot of home hobbyists and enthusiasts who are using out on a daily basis. And having that sort of big pool of very diverse and varied customers to chat with as they're using the product, as they're giving us feedback, et cetera, has, you know, pointed us in a really good direction in terms of making sure we're continuously improving that, and then also making these big leaps as we're doing with this new engine. >> All right, so you've called it a transparent change for customers, so I'm presuming it's non-disruptive, but I really want to understand how much of a pivot this is, and what does it take to make that shift from, you know, time series specialist to real time analytics and being able to support both? >> Yeah, I mean, it's much more of an evolution, I think, than like a shift or a pivot. Time series data is always going to be fundamental in sort of the basis of the solutions that we offer our customers, and then also the ones that they're building on the sort of raw APIs of our platform themselves. The time series market is one that we've worked diligently to lead. I mean, I think when it comes to like metrics, especially like sensor data and app and infrastructure metrics. If we're being honest though, I think our user base is well aware that the way we were architected was much more towards those sort of like backwards-looking historical type analytics, which are key for troubleshooting and making sure you don't, you know, run into the same problem twice. But, you know, we had to ask ourselves like, what can we do to like better handle those queries from a performance and a time to response on the queries, and can we get that to the point where the result sets are coming back so quickly from the time of query that we can like, limit that window down to minutes and then seconds? And now with this new engine, we're really starting to talk about a query window that could be like returning results in, you know, milliseconds of time since it hit the ingest queue. And that's really getting to the point where, as your data is available, you can use it and you can query it, you can visualize it, you can do all those sort of magical things with it. And I think getting all of that to a place where we're saying like, yes to the customer on, you know, all of the real time queries, the multiple language query support. But, you know, it was hard, but we're now at a spot where we can start introducing that to, you know, a limited number of customers, strategic customers and strategic availabilities zones to start, but, you know, everybody over time. >> So you're basically going from what happened to, and you can still do that, obviously, but to what's happening now in the moment? >> Yeah. Yeah. I mean, if you think about time, it's always sort of past, right? I mean, like in the moment right now, whether you're talking about like a millisecond ago or a minute ago, you know, that's pretty much right now, I think for most people, especially in these use cases where you have other sort of components of latency induced by the underlying data collection, the architecture, the infrastructure, the devices, and you know, the sort of highly distributed nature of all of this. So, yeah, I mean, getting a customer or a user to be able to use the data as soon as it is available, is what we're after here. I always thought of real time as before you lose the customer, but now in this context, maybe it's before the machine blows up. >> Yeah, I mean, it is operationally, or operational real time is different. And that's one of the things that really triggered us to know that we were heading in the right direction is just how many sort of operational customers we have, you know, everything from like aerospace and defense. We've got companies monitoring satellites. We've got tons of industrial users using us as a process historian on the plant floor. And if we can satisfy their sort of demands for like real time historical perspective, that's awesome. I think what we're going to do here is we're going to start to like edge into the real time that they're used to in terms of, you know, the millisecond response times that they expect of their control systems, certainly not their historians and databases. >> Is this available, these innovations to InfluxDB cloud customers, only who can access this capability? >> Yeah, I mean, commercially and today, yes. I think we want to emphasize that for now our goal is to get our latest and greatest and our best to everybody over time of course. You know, one of the things we had to do here was like we doubled down on sort of our commitment to open source and availability. So, like, anybody today can take a look at the libraries on our GitHub and can inspect it and even can try to implement or execute some of it themselves in their own infrastructure. We are committed to bringing our sort of latest and greatest to our cloud customers first for a couple of reasons. Number one, you know, there are big workloads and they have high expectations of us. I think number two, it also gives us the opportunity to monitor a little bit more closely how it's working, how they're using it, like how the system itself is performing. And so just, you know, being careful, maybe a little cautious in terms of how big we go with this right away. Just sort of both limits, you know, the risk of any issues that can come with new software roll outs, we haven't seen anything so far. But also it does give us the opportunity to have like meaningful conversations with a small group of users who are using the products. But once we get through that and they give us two thumbs up on it, it'll be like, open the gates and let everybody in. It's going to be exciting time for the whole ecosystem. >> Yeah, that makes a lot of sense. And you can do some experimentation and, you know, using the cloud resources. Let's dig into some of the architectural and technical innovations that are going to help deliver on this vision. What should we know there? >> Well, I mean, I think, foundationally, we built the new core on Rust. This is a new very sort of popular systems language. It's extremely efficient, but it's also built for speed and memory safety, which goes back to that us being able to like deliver it in a way that is, you know, something we can inspect very closely, but then also rely on the fact that it's going to behave well, and if it does find error conditions. I mean, we've loved working with Go, and a lot of our libraries will continue to be sort of implemented in Go, but when it came to this particular new engine, that power performance and stability of Rust was critical. On top of that, like, we've also integrated Apache Arrow and Apache Parquet for persistence. I think, for anybody who's really familiar with the nuts and bolts of our backend and our TSI and our time series merge trees, this is a big break from that. You know, Arrow on the sort of in mem side and then Parquet in the on disk side. It allows us to present, you know, a unified set of APIs for those really fast real time queries that we talked about, as well as for very large, you know, historical sort of bulk data archives in that Parquet format, which is also cool because there's an entire ecosystem sort of popping up around Parquet in terms of the machine learning community. And getting that all to work, we had to glue it together with Arrow Flight. That's sort of what we're using as our RPC component. It handles the orchestration and the transportation of the columnar data now, we're moving to like a true columnar database model for this version of the engine. You know, and it removes a lot of overhead for us in terms of having to manage all that serialization, the deserialization, and, you know, to that again, like, blurring that line between real time and historical data, it's highly optimized for both streaming micro batch and then batches, but true streaming as well. >> Yeah, again, I mean, it's funny. You mentioned Rust. It's been around for a long time but it's popularity is, you know, really starting to hit that steep part of the S-curve. And we're going to dig into more of that, but give us, is there anything else that we should know about, Brian? Give us the last word. >> Well, I mean, I think first, I'd like everybody sort of watching, just to like, take a look at what we're offering in terms of early access in beta programs. I mean, if you want to participate or if you want to work sort of in terms of early access with the new engine, please reach out to the team. I'm sure, you know, there's a lot of communications going out and it'll be highly featured on our website. But reach out to the team. Believe it or not, like we have a lot more going on than just the new engine. And so there are also other programs, things we're offering to customers in terms of the user interface, data collection and things like that. And, you know, if you're a customer of ours and you have a sales team, a commercial team that you work with, you can reach out to them and see what you can get access to, because we can flip a lot of stuff on, especially in cloud through feature flags. But if there's something new that you want to try out, we'd just love to hear from you. And then, you know, our goal would be, that as we give you access to all of these new cool features that, you know, you would give us continuous feedback on these products and services, not only like what you need today, but then what you'll need tomorrow to sort of build the next versions of your business. Because, you know, the whole database, the ecosystem as it expands out into this vertically-oriented stack of cloud services, and enterprise databases, and edge databases, you know, it's going to be what we all make it together, not just those of us who are employed by InfluxDB. And then finally, I would just say, please, like, watch and Anais' and Tim's sessions. Like, these are two of our best and brightest. They're totally brilliant, completely pragmatic, and they are most of all customer-obsessed, which is amazing. And there's no better takes, like honestly, on the sort of technical details of this than theirs, especially when it comes to the value that these investments will bring to our customers and our communities. So, encourage you to, you know, pay more attention to them than you did to me, for sure. >> Brian Gilmore, great stuff. Really appreciate your time. Thank you. >> Yeah, thanks David, it was awesome. Looking forward to it. >> Yeah, me too. I'm looking forward to see how the community actually applies these new innovations and goes beyond just the historical into the real time. Really hot area. As Brian said, in a moment, I'll be right back with Anais Dotis-Georgiou to dig into the critical aspects of key open source components of the InfluxDB engine, including Rust, Arrow, Parquet, Data Fusion. Keep it right there. You don't want to miss this. (soft upbeat music)
SUMMARY :
He's the director of IoT, I appreciate the time. you know, needs a new engine. sort of with now, you know, and the architecture and the like. I guess, you know, phase one was, that the way we were architected the devices, and you know, in terms of, you know, the And so just, you know, being careful, experimentation and, you know, in a way that is, you know, but it's popularity is, you know, And then, you know, our goal would be, Really appreciate your time. Looking forward to it. and goes beyond just the
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
David | PERSON | 0.99+ |
Brian Gilmore | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
Brian | PERSON | 0.99+ |
Tim | PERSON | 0.99+ |
60,000 people | QUANTITY | 0.99+ |
InfluxData | ORGANIZATION | 0.99+ |
two | QUANTITY | 0.99+ |
today | DATE | 0.99+ |
three years ago | DATE | 0.99+ |
twice | QUANTITY | 0.99+ |
Parquet | TITLE | 0.99+ |
both | QUANTITY | 0.98+ |
Anais' | PERSON | 0.98+ |
first | QUANTITY | 0.98+ |
tomorrow | DATE | 0.98+ |
Rust | TITLE | 0.98+ |
one | QUANTITY | 0.98+ |
a minute ago | DATE | 0.95+ |
two thumbs | QUANTITY | 0.95+ |
Arrow | TITLE | 0.94+ |
Anais Dotis-Georgiou | PERSON | 0.92+ |
tons | QUANTITY | 0.9+ |
InfluxDB | TITLE | 0.85+ |
Bri | PERSON | 0.82+ |
Apache | ORGANIZATION | 0.82+ |
InfluxDB | ORGANIZATION | 0.8+ |
GitHub | ORGANIZATION | 0.78+ |
phase one | QUANTITY | 0.73+ |
both enterprises | QUANTITY | 0.69+ |
SAS | ORGANIZATION | 0.68+ |
phase two | QUANTITY | 0.67+ |
Go | TITLE | 0.65+ |
Gilmore | PERSON | 0.63+ |
millisecond ago | DATE | 0.62+ |
Arrow | ORGANIZATION | 0.59+ |
Flight | ORGANIZATION | 0.52+ |
Data Fusion | TITLE | 0.46+ |
Go | ORGANIZATION | 0.41+ |