UNLIST TILL 4/2 - Extending Vertica with the Latest Vertica Ecosystem and Open Source Initiatives

>> Sue: Hello everybody. Thank you for joining us today for the Virtual Vertica BDC 2020. Today's breakout session in entitled Extending Vertica with the Latest Vertica Ecosystem and Open Source Initiatives. My name is Sue LeClaire, Director of Marketing at Vertica and I'll be your host for this webinar. Joining me is Tom Wall, a member of the Vertica engineering team. But before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait. Just type your question or comment in the question box below the slides and click submit. There will be a Q and A session at the end of the presentation. We'll answer as many questions as we're able to during that time. Any questions that we don't get to, we'll do our best to answer them offline. Alternatively, you can visit the Vertica forums to post you questions after the session. Our engineering team is planning to join the forums to keep the conversation going. Also a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slides. And yes, this virtual session is being recorded and will be available to view on demand later this week. We'll send you a notification as soon as it's ready. So let's get started. Tom, over to you. >> Tom: Hello everyone and thanks for joining us today for this talk. My name is Tom Wall and I am the leader of Vertica's ecosystem engineering team. We are the team that focuses on building out all the developer tools, third party integrations that enables the SoftMaker system that surrounds Vertica to thrive. So today, we'll be talking about some of our new open source initatives and how those can be really effective for you and make things easier for you to build and integrate Vertica with the rest of your technology stack. We've got several new libraries, integration projects and examples, all open source, to share, all being built out in the open on our GitHub page. Whether you use these open source projects or not, this is a very exciting new effort that will really help to grow the developer community and enable lots of exciting new use cases. So, every developer out there has probably had to deal with the problem like this. You have some business requirements, to maybe build some new Vertica-powered application. Maybe you have to build some new system to visualize some data that's that's managed by Vertica. The various circumstances, lots of choices will might be made for you that constrain your approach to solving a particular problem. These requirements can come from all different places. Maybe your solution has to work with a specific visualization tool, or web framework, because the business has already invested in the licensing and the tooling to use it. Maybe it has to be implemented in a specific programming language, since that's what all the developers on the team know how to write code with. While Vertica has many different integrations with lots of different programming language and systems, there's a lot of them out there, and we don't have integrations for all of them. So how do you make ends meet when you don't have all the tools you need? All you have to get creative, using tools like PyODBC, for example, to bridge between programming languages and frameworks to solve the problems you need to solve. Most languages do have an ODBC-based database interface. ODBC is our C-Library and most programming languages know how to call C code, somehow. So that's doable, but it often requires lots of configuration and troubleshooting to make all those moving parts work well together. So that's enough to get the job done but native integrations are usually a lot smoother and easier. So rather than, for example, in Python trying to fight with PyODBC, to configure things and get Unicode working, and to compile all the different pieces, the right way is to make it all work smoothly. It would be much better if you could just PIP install library and get to work. And with Vertica-Python, a new Python client library, you can actually do that. So that story, I assume, probably sounds pretty familiar to you. Sounds probably familiar to a lot of the audience here because we're all using Vertica. And our challenge, as Big Data practitioners is to make sense of all this stuff, despite those technical and non-technical hurdles. Vertica powers lots of different businesses and use cases across all kinds of different industries and verticals. While there's a lot different about us, we're all here together right now for this talk because we do have some things in common. We're all using Vertica, and we're probably also using Vertica with other systems and tools too, because it's important to use the right tool for the right job. That's a founding principle of Vertica and it's true today too. In this constantly changing technology landscape, we need lots of good tools and well established patterns, approaches, and advice on how to combine them so that we can be successful doing our jobs. Luckily for us, Vertica has been designed to be easy to build with and extended in this fashion. Databases as a whole had had this goal from the very beginning. They solve the hard problems of managing data so that you don't have to worry about it. Instead of worrying about those hard problems, you can focus on what matters most to you and your domain. So implementing that business logic, solving that problem, without having to worry about all of these intense, sometimes details about what it takes to manage a database at scale. With the declarative syntax of SQL, you tell Vertica what the answer is that you want. You don't tell Vertica how to get it. Vertica will figure out the right way to do it for you so that you don't have to worry about it. So this SQL abstraction is very nice because it's a well defined boundary where lots of developers know SQL, and it allows you to express what you need without having to worry about those details. So we can be the experts in data management while you worry about your problems. This goes beyond though, what's accessible through SQL to Vertica. We've got well defined extension and integration points across the product that allow you to customize this experience even further. So if you want to do things write your own SQL functions, or extend database softwares with UDXs, you can do so. If you have a custom data format that might be a proprietary format, or some source system that Vertica doesn't natively support, we have extension points that allow you to use those. To make it very easy to do passive, parallel, massive data movement, loading into Vertica but also to export Vertica to send data to other systems. And with these new features in time, we also could do the same kinds of things with Machine Learning models, importing and exporting to tools like TensorFlow. And it's these integration points that have enabled Vertica to build out this open architecture and a rich ecosystem of tools, both open source and closed source, of different varieties that solve all different problems that are common in this big data processing world. Whether it's open source, streaming systems like Kafka or Spark, or more traditional ETL tools on the loading side, but also, BI tools and visualizers and things like that to view and use the data that you keep in your database on the right side. And then of course, Vertica needs to be flexible enough to be able to run anywhere. So you can really take Vertica and use it the way you want it to solve the problems that you need to solve. So Vertica has always employed open standards, and integrated it with all kinds of different open source systems. What we're really excited to talk about now is that we are taking our new integration projects and making those open source too. In particular, we've got two new open source client libraries that allow you to build Vertica applications for Python and Go. These libraries act as a foundation for all kinds of interesting applications and tools. Upon those libraries, we've also built some integrations ourselves. And we're using these new libraries to power some new integrations with some third party products. Finally, we've got lots of new examples and reference implementations out on our GitHub page that can show you how to combine all these moving parts and exciting ways to solve new problems. And the code for all these things is available now on our GitHub page. And so you can use it however you like, and even help us make it better too. So the first such project that we have is called Vertica-Python. Vertica-Python began at our customer, Uber. And then in late 2018, we collaborated with them and we took it over and made Vertica-Python the first official open source client for Vertica You can use this to build your own Python applications, or you can use it via tools that were written in Python. Python has grown a lot in recent years and it's very common language to solve lots of different problems and use cases in the Big Data space from things like DevOps admission and Data Science or Machine Learning, or just homegrown applications. We use Python a lot internally for our own QA testing and automation needs. And with the Python 2 End Of Life, that happened at the end of 2019, it was important that we had a robust Python solution to help migrate our internal stuff off of Python 2. And also to provide a nice migration path for all of you our users that might be worried about the same problems with their own Python code. So Vertica-Python is used already for lots of different tools, including Vertica's admintools now starting with 9.3.1. It was also used by DataDog to build a Vertica-DataDog integration that allows you to monitor your Vertica infrastructure within DataDog. So here's a little example of how you might use the Python Client to do some some work. So here we open in connection, we run a query to find out what node we've connected to, and then we do a little DataLoad by running a COPY statement. And this is designed to have a familiar look and feel if you've ever used a Python Database Client before. So we implement the DB API 2.0 standard and it feels like a Python package. So that includes things like, it's part of the centralized package manager, so you can just PIP install this right now and go start using it. We also have our client for Go length. So this is called vertica-sql-go. And this is a very similar story, just in a different context or the different programming language. So vertica-sql-go, began as a collaboration with the Microsoft Focus SecOps Group who builds microfocus' security products some of which use vertica internally to provide some of those analytics. So you can use this to build your own apps in the Go programming language but you can also use it via tools that are written Go. So most notably, we have our Grafana integration, which we'll talk a little bit more about later, that leverages this new clients to provide Grafana visualizations for vertica data. And Go is another rising popularity programming language 'cause it offers an interesting balance of different programming design trade-offs. So it's got good performance, got a good current concurrency and memory safety. And we liked all those things and we're using it to power some internal monitoring stuff of our own. And here's an example of the code you can write with this client. So this is Go code that does a similar thing. It opens a connection, it runs a little test query, and then it iterates over those rows, processing them using Go data types. You get that native look and feel just like you do in Python, except this time in the Go language. And you can go get it the way you usually package things with Go by running that command there to acquire this package. And it's important to note here for the DC projects, we're really doing open source development. We're not just putting code out on our GitHub page. So if you go out there and look, you can see that you can ask questions, you can report bugs, you can submit poll requests yourselves and you can collaborate directly with our engineering team and the other vertica users out on our GitHub page. Because it's out on our GitHub page, it allows us to be a little bit faster with the way we ship and deliver functionality compared to the core vertica release cycle. So in 2019, for example, as we were building features to prepare for the Python 3 migration, we shipped 11 different releases with 40 customer reported issues, filed on GitHub. That was done over 78 different poll requests and with lots of community engagement as we do so. So lots of people are using this already, we see as our GitHub badge last showed with about 5000 downloads of this a day of people using it in their software. And again, we want to make this easy, not just to use but also to contribute and understand and collaborate with us. So all these projects are built using the Apache 2.0 license. The master branch is always available and stable with the latest creative functionality. And you can always build it and test it the way we do so that it's easy for you to understand how it works and to submit contributions or bug fixes or even features. It uses automated testing both for locally and with poll requests. And for vertica-python, it's fully automated with Travis CI. So we're really excited about doing this and we're really excited about where it can go in the future. 'Cause this offers some exciting opportunities for us to collaborate with you more directly than we have ever before. You can contribute improvements and help us guide the direction of these projects, but you can also work with each other to share knowledge and implementation details and various best practices. And so maybe you think, "Well, I don't use Python, "I don't use go so maybe it doesn't matter to me." But I would argue it really does matter. Because even if you don't use these tools and languages, there's lots of amazing vertica developers out there who do. And these clients do act as low level building blocks for all kinds of different interesting tools, both in these Python and Go worlds, but also well beyond that. Because these implementations and examples really generalize to lots of different use cases. And we're going to do a deeper dive now into some of these to understand exactly how that's the case and what you can do with these things. So let's take a deeper look at some of the details of what it takes to build one of these open source client libraries. So these database client interfaces, what are they exactly? Well, we all know SQL, but if you look at what SQL specifies, it really only talks about how to manipulate the data within the database. So once you're connected and in, you can run commands with SQL. But these database client interfaces address the rest of those needs. So what does the programmer need to do to actually process those SQL queries? So these interfaces are specific to a particular language or a technology stack. But the use cases and the architectures and design patterns are largely the same between different languages. They all have a need to do some networking and connect and authenticate and create a session. They all need to be able to run queries and load some data and deal with problems and errors. And then they also have a lot of metadata and Type Mapping because you want to use these clients the way you use those programming languages. Which might be different than the way that vertica's data types and vertica's semantics work. So some of this client interfaces are truly standards. And they are robust enough in terms of what they design and call for to support a truly pluggable driver model. Where you might write an application that codes directly against the standard interface, and you can then plug in a different database driver, like a JDBC driver, to have that application work with any database that has a JDBC driver. So most of these interfaces aren't as robust as a JDBC or ODBC but that's okay. 'Cause it's good as a standard is, every database is unique for a reason. And so you can't really expose all of those unique properties of a database through these standard interfaces. So vertica's unique in that it can scale to the petabytes and beyond. And you can run it anywhere in any environment, whether it's on-prem or on clouds. So surely there's something about vertica that's unique, and we want to be able to take advantage of that fact in our solutions. So even though these standards might not cover everything, there's often a need and common patterns that arise to solve these problems in similar ways. When there isn't enough of a standard to define those comments, semantics that different databases might have in common, what you often see is tools will invent plug in layers or glue code to compensate by defining application wide standard to cover some of these same semantics. Later on, we'll get into some of those details and show off what exactly that means. So if you connect to a vertica database, what's actually happening under the covers? You have an application, you have a need to run some queries, so what does that actually look like? Well, probably as you would imagine, your application is going to invoke some API calls and some client library or tool. This library takes those API calls and implements them, usually by issuing some networking protocol operations, communicating over the network to ask vertica to do the heavy lifting required for that particular API call. And so these API's usually do the same kinds of things although some of the details might differ between these different interfaces. But you do things like establish a connection, run a query, iterate over your rows, manage your transactions, that sort of thing. Here's an example from vertica-python, which just goes into some of the details of what actually happens during the Connect API call. And you can see all these details in our GitHub implementation of this. There's actually a lot of moving parts in what happens during a connection. So let's walk through some of that and see what actually goes on. I might have my API call like this where I say Connect and I give it a DNS name, which is my entire cluster. And I give you my connection details, my username and password. And I tell the Python Client to get me a session, give me a connection so I can start doing some work. Well, in order to implement this, what needs to happen? First, we need to do some TCP networking to establish our connection. So we need to understand what the request is, where you're going to connect to and why, by pressing the connection string. and vertica being a distributed system, we want to provide high availability, so we might need to do some DNS look-ups to resolve that DNS name which might be an entire cluster and not just a single machine. So that you don't have to change your connection string every time you add or remove nodes to the database. So we do some high availability and DNS lookup stuff. And then once we connect, we might do Load Balancing too, to balance the connections across the different initiator nodes in the cluster, or in a sub cluster, as needed. Once we land on the node we want to be at, we might do some TLS to secure our connections. And vertica supports the industry standard TLS protocols, so this looks pretty familiar for everyone who've used TLS anywhere before. So you're going to do a certificate exchange and the client might send the server certificate too, and then you going to verify that the server is who it says it is, so that you can know that you trust it. Once you've established that connection, and secured it, then you can start actually beginning to request a session within vertica. So you going to send over your user information like, "Here's my username, "here's the database I want to connect to." You might send some information about your application like a session label, so that you can differentiate on the database with monitoring queries, what the different connections are and what their purpose is. And then you might also send over some session settings to do things like auto commit, to change the state of your session for the duration of this connection. So that you don't have to remember to do that with every query that you have. Once you've asked vertica for a session, before vertica will give you one, it has to authenticate you. and vertica has lots of different authentication mechanisms. So there's a negotiation that happens there to decide how to authenticate you. Vertica decides based on who you are, where you're coming from on the network. And then you'll do an auth-specific exchange depending on what the auth mechanism calls for until you are authenticated. Finally, vertica trusts you and lets you in, so you going to establish a session in vertica, and you might do some note keeping on the client side just to know what happened. So you might log some information, you might record what the version of the database is, you might do some protocol feature negotiation. So if you connect to a version of the database that doesn't support all these protocols, you might decide to turn some functionality off and that sort of thing. But finally, after all that, you can return from this API call and then your connection is good to go. So that connection is just one example of many different APIs. And we're excited here because with vertica-python we're really opening up the vertica client wire protocol for the first time. And so if you're a low level vertica developer and you might have used Postgres before, you might know that some of vertica's client protocol is derived from Postgres. But they do differ in many significant ways. And this is the first time we've ever revealed those details about how it works and why. So not all Postgres protocol features work with vertica because vertica doesn't support all the features that Postgres does. Postgres, for example, has a large object interface that allows you to stream very wide data values over. Whereas vertica doesn't really have very wide data values, you have 30, you have long bar charts, but that's about as wide as you can get. Similarly, the vertica protocol supports lots of features not present in Postgres. So Load Balancing, for example, which we just went through an example of, Postgres is a single node system, it doesn't really make sense for Postgres to have Load Balancing. But Load Balancing is really important for vertica because it is a distributed system. Vertica-python serves as an open reference implementation of this protocol. With all kinds of new details and extension points that we haven't revealed before. So if you look at these boxes below, all these different things are new protocol features that we've implemented since August 2019, out in the open on our GitHub page for Python. Now, the vertica-sql-go implementation of these things is still in progress, but the core protocols are there for basic query operations. There's more to do there but we'll get there soon. So this is really cool 'cause not only do you have now a Python Client implementation, and you have a Go client implementation of this, but you can use this protocol reference to do lots of other things, too. The obvious thing you could do is build more clients for other languages. So if you have a need for a client in some other language that are vertica doesn't support yet, now you have everything available to solve that problem and to go about doing so if you need to. But beyond clients, it's also used for other things. So you might use it for mocking and testing things. So rather than connecting to a real vertica database, you can simulate some of that. You can also use it to do things like query routing and proxies. So Uber, for example, this log here in this link tells a great story of how they route different queries to different vertical clusters by intercepting these protocol messages, parsing the queries in them and deciding which clusters to send them to. So a lot of these things are just ideas today, but now that you have the source code, there's no limit in sight to what you can do with this thing. And so we're very interested in hearing your ideas and requests and we're happy to offer advice and collaborate on building some of these things together. So let's take a look now at some of the things we've already built that do these things. So here's a picture of vertica's Grafana connector with some data powered from an example that we have in this blog link here. So this has an internet of things use case to it, where we have lots of different sensors recording flight data, feeding into Kafka which then gets loaded into vertica. And then finally, it gets visualized nicely here with Grafana. And Grafana's visualizations make it really easy to analyze the data with your eyes and see when something something happens. So in these highlighted sections here, you notice a drop in some of the activity, that's probably a problem worth looking into. It might be a lot harder to see that just by staring at a large table yourself. So how does a picture like that get generated with a tool like Grafana? Well, Grafana specializes in visualizing time series data. And time can be really tricky for computers to do correctly. You got time zones, daylight savings, leap seconds, negative infinity timestamps, please don't ever use those. In every system, if it wasn't hard enough, just with those problems, what makes it harder is that every system does it slightly differently. So if you're querying some time data, how do we deal with these semantic differences as we cross these domain boundaries from Vertica to Grafana's back end architecture, which is implemented in Go on it's front end, which is implemented with JavaScript? Well, you read this from bottom up in terms of the processing. First, you select the timestamp and Vertica is timestamp has to be converted to a Go time object. And we have to reconcile the differences that there might be as we translate it. So Go time has a different time zone specifier format, and it also supports nanosecond precision, while Vertica only supports microsecond precision. So that's not too big of a deal when you're querying data because you just see some extra zeros, not fractional seconds. But on the way in, if we're loading data, we have to find a way to resolve those things. Once it's into the Go process, it has to be converted further to render in the JavaScript UI. So that there, the Go time object has to be converted to a JavaScript Angular JS Date object. And there too, we have to reconcile those differences. So a lot of these differences might just be presentation, and not so much the actual data changing, but you might want to choose to render the date into a more human readable format, like we've done in this example here. Here's another picture. This is another picture of some time series data, and this one shows you can actually write your own queries with Grafana to provide answers. So if you look closely here you can see there's actually some functions that might not look too familiar with you if you know vertica's functions. Vertica doesn't have a dollar underscore underscore time function or a time filter function. So what's actually happening there? How does this actually provide an answer if it's not really real vertica syntax? Well, it's not sufficient to just know how to manipulate data, it's also really important that you know how to operate with metadata. So information about how the data works in the data source, Vertica in this case. So Grafana needs to know how time works in detail for each data source beyond doing that basic I/O that we just saw in the previous example. So it needs to know, how do you connect to the data source to get some time data? How do you know what time data types and functions there are and how they behave? How do you generate a query that references a time literal? And finally, once you've figured out how to do all that, how do you find the time in the database? How do you do know which tables have time columns and then they might be worth rendering in this kind of UI. So Go's database standard doesn't actually really offer many metadata interfaces. Nevertheless, Grafana needs to know those answers. And so it has its own plugin layer that provides a standardizing layer whereby every data source can implement hints and metadata customization needed to have an extensible data source back end. So we have another open source project, the Vertica-Grafana data source, which is a plugin that uses Grafana's extension points with JavaScript and the front end plugins and also with Go in the back end plugins to provide vertica connectivity inside Grafana. So the way this works, is that the plugin frameworks defines those standardizing functions like time and time filter, and it's our plugin that's going to rewrite them in terms of vertica syntax. So in this example, time gets rewritten to a vertica cast. And time filter becomes a BETWEEN predicate. So that's one example of how you can use Grafana, but also how you might build any arbitrary visualization tool that works with data in Vertica. So let's now look at some other examples and reference architectures that we have out in our GitHub page. For some advanced integrations, there's clearly a need to go beyond these standards. So SQL and these surrounding standards, like JDBC, and ODBC, were really critical in the early days of Vertica, because they really enabled a lot of generic database tools. And those will always continue to play a really important role, but the Big Data technology space moves a lot faster than these old database data can keep up with. So there's all kinds of new advanced analytics and query pushdown logic that were never possible 10 or 20 years ago, that Vertica can do natively. There's also all kinds of data-oriented application workflows doing things like streaming data, or Parallel Loading or Machine Learning. And all of these things, we need to build software with, but we don't really have standards to go by. So what do we do there? Well, open source implementations make for easier integrations, and applications all over the place. So even if you're not using Grafana for example, other tools have similar challenges that you need to overcome. And it helps to have an example there to show you how to do it. Take Machine Learning, for example. There's been many excellent Machine Learning tools that have arisen over the years to make data science and the task of Machine Learning lot easier. And a lot of those have basic database connectivity, but they generally only treat the database as a source of data. So they do lots of data I/O to extract data from a database like Vertica for processing in some other engine. We all know that's not the most efficient way to do it. It's much better if you can leverage Vertica scale and bring the processing to the data. So a lot of these tools don't take full advantage of Vertica because there's not really a uniform way to go do so with these standards. So instead, we have a project called vertica-ml-python. And this serves as a reference architecture of how you can do scalable machine learning with Vertica. So this project establishes a familiar machine learning workflow that scales with vertica. So it feels similar to like a scickit-learn project except all the processing and aggregation and heavy lifting and data processing happens in vertica. So this makes for a much more lightweight, scalable approach than you might otherwise be used to. So with vertica-ml-python, you can probably use this yourself. But you could also see how it works. So if it doesn't meet all your needs, you could still see the code and customize it to build your own approach. We've also got lots of examples of our UDX framework. And so this is an older GitHub project. We've actually had this for a couple of years, but it is really useful and important so I wanted to plug it here. With our User Defined eXtensions framework or UDXs, this allows you to extend the operators that vertica executes when it does a database load or a database query. So with UDXs, you can write your own domain logic in a C++, Java or Python or R. And you can call them within the context of a SQL query. And vertica brings your logic to that data, and makes it fast and scalable and fault tolerant and correct for you. So you don't have to worry about all those hard problems. So our UDX examples, demonstrate how you can use our SDK to solve interesting problems. And some of these examples might be complete, total usable packages or libraries. So for example, we have a curl source that allows you to extract data from any curlable endpoint and load into vertica. We've got things like an ODBC connector that allows you to access data in an external database via an ODBC driver within the context of a vertica query, all kinds of parsers and string processors and things like that. We also have more exciting and interesting things where you might not really think of vertica being able to do that, like a heat map generator, which takes some XY coordinates and renders it on top of an image to show you the hotspots in it. So the image on the right was actually generated from one of our intern gaming sessions a few years back. So all these things are great examples that show you not just how you can solve problems, but also how you can use this SDK to solve neat things that maybe no one else has to solve, or maybe that are unique to your business and your needs. Another exciting benefit is with testing. So the test automation strategy that we have in vertica-python these clients, really generalizes well beyond the needs of a database client. Anyone that's ever built a vertica integration or an application, probably has a need to write some integration tests. And that could be hard to do with all the moving parts, in the big data solution. But with our code being open source, you can see in vertica-python, in particular, how we've structured our tests to facilitate smooth testing that's fast, deterministic and easy to use. So we've automated the download process, the installation deployment process, of a Vertica Community Edition. And with a single click, you can run through the tests locally and part of the PR workflow via Travis CI. We also do this for multiple different python environments. So for all python versions from 2.7 up to 3.8 for different Python interpreters, and for different Linux distros, we're running through all of them very quickly with ease, thanks to all this automation. So today, you can see how we do it in vertica-python, in the future, we might want to spin that out into its own stand-alone testbed starter projects so that if you're starting any new vertica integration, this might be a good starting point for you to get going quickly. So that brings us to some of the future work we want to do here in the open source space . Well, there's a lot of it. So in terms of the the client stuff, for Python, we are marching towards our 1.0 release, which is when we aim to be protocol complete to support all of vertica's unique protocols, including COPY LOCAL and some new protocols invented to support complex types, which is our new feature in vertica 10. We have some cursor enhancements to do things like better streaming and improved performance. Beyond that we want to take it where you want to bring it. So send us your requests in the Go client fronts, just about a year behind Python in terms of its protocol implementation, but the basic operations are there. But we still have more work to do to implement things like load balancing, some of the advanced auths and other things. But they're two, we want to work with you and we want to focus on what's important to you so that we can continue to grow and be more useful and more powerful over time. Finally, this question of, "Well, what about beyond database clients? "What else might we want to do with open source?" If you're building a very deep or a robust vertica integration, you probably need to do a lot more exciting things than just run SQL queries and process the answers. Especially if you're an OEM or you're a vendor that resells vertica packaged as a black box piece of a larger solution, you might to have managed the whole operational lifecycle of vertica. There's even fewer standards for doing all these different things compared to the SQL clients. So we started with the SQL clients 'cause that's a well established pattern, there's lots of downstream work that that can enable. But there's also clearly a need for lots of other open source protocols, architectures and examples to show you how to do these things and do have real standards. So we talked a little bit about how you could do UDXs or testing or Machine Learning, but there's all sorts of other use cases too. That's why we're excited to announce here our awesome vertica, which is a new collection of open source resources available on our GitHub page. So if you haven't heard of this awesome manifesto before, I highly recommend you check out this GitHub page on the right. We're not unique here but there's lots of awesome projects for all kinds of different tools and systems out there. And it's a great way to establish a community and share different resources, whether they're open source projects, blogs, examples, references, community resources, and all that. And this tool is an open source project. So it's an open source wiki. And you can contribute to it by submitting yourself to PR. So we've seeded it with some of our favorite tools and projects out there but there's plenty more out there and we hope to see more grow over time. So definitely check this out and help us make it better. So with that, I'm going to wrap up. I wanted to thank you all. Special thanks to Siting Ren and Roger Huebner, who are the project leads for the Python and Go clients respectively. And also, thanks to all the customers out there who've already been contributing stuff. This has already been going on for a long time and we hope to keep it going and keep it growing with your help. So if you want to talk to us, you can find us at this email address here. But of course, you can also find us on the Vertica forums, or you could talk to us on GitHub too. And there you can find links to all the different projects I talked about today. And so with that, I think we're going to wrap up and now we're going to hand it off for some Q&A.

Published Date : Mar 30 2020

SUMMARY :

Also a reminder that you can maximize your screen and frameworks to solve the problems you need to solve.

ENTITIES

Entity	Category	Confidence
Tom Wall	PERSON	0.99+
Sue LeClaire	PERSON	0.99+
Uber	ORGANIZATION	0.99+
Roger Huebner	PERSON	0.99+
Vertica	ORGANIZATION	0.99+
Tom	PERSON	0.99+
Python 2	TITLE	0.99+
August 2019	DATE	0.99+
2019	DATE	0.99+
Python 3	TITLE	0.99+
two	QUANTITY	0.99+
Sue	PERSON	0.99+
Python	TITLE	0.99+
python	TITLE	0.99+
SQL	TITLE	0.99+
late 2018	DATE	0.99+
First	QUANTITY	0.99+
end of 2019	DATE	0.99+
Vertica	TITLE	0.99+
today	DATE	0.99+
Java	TITLE	0.99+
Spark	TITLE	0.99+
C++	TITLE	0.99+
JavaScript	TITLE	0.99+
vertica-python	TITLE	0.99+
Today	DATE	0.99+
first time	QUANTITY	0.99+
11 different releases	QUANTITY	0.99+
UDXs	TITLE	0.99+
Kafka	TITLE	0.99+
Extending Vertica with the Latest Vertica Ecosystem and Open Source Initiatives	TITLE	0.98+
Grafana	ORGANIZATION	0.98+
PyODBC	TITLE	0.98+
first	QUANTITY	0.98+
UDX	TITLE	0.98+
vertica 10	TITLE	0.98+
ODBC	TITLE	0.98+
10	DATE	0.98+
Postgres	TITLE	0.98+
DataDog	ORGANIZATION	0.98+
40 customer reported issues	QUANTITY	0.97+
both	QUANTITY	0.97+

Morgan McLean, Google Cloud Platform & Ben Sigelman, LightStep | KubeCon + CloudNativeCon EU 2019

>> Live from Barcelona, Spain it's theCUBE, covering KubeCon, CloudNativeCon, Europe 2019. Brought to you by Red Hat, the Cloud Native Computing Foundation and Ecosystem Partners. >> Welcome back. This is theCUBE's coverage of KubeCon, CloudNativeCon 2019. I'm Stu Miniman, my co-host for two days wall-to-wall coverage is Corey Quinn. Happy to welcome back to the program first Ben Sigelman, who is the co-founder and CEO of LightStep. And welcome to the program a first time Morgan McLean, who's a product manager at Google Cloud Platform. Gentlemen, thanks so much for joining us. >> Thanks for having us. >> Yeah. >> All right so, this was a last minute ad for us because you guys had some interesting news in the keynote. I think the feedback everybody's heard is there's too many projects and everything's overlapping, and how do I make a decision, but interesting piece is OpenCensus, which Morgan was doing, and OpenTracing, which Ben and LightStep were doing are now moving together for OpenTelemetry if I got it right. >> Yup. >> So, is it just everybody's holding hands and singing Kumbaya around the Kubernetes campfire, or is there something more to this? >> Well I mean, it started when the CNCF locked us in a room and told us there were too many projects. (Stu and Ben laughing) Really wouldn't let us leave. No, to be fair they did actually take us to a room and really start the ball rolling, but conversations have picked up for the last few months and personally I'm just really excited that it's gone so well. Initially if you told me six or nine months ago that this would happen, I would've been, given just the way the projects were going, both were growing very quickly, I would've been a little skeptical. But seriously, this merger's gone beyond my wildest dreams. It's awesome, both to unite the communities, it's awesome to unite the projects together. >> What has the response been from the communities on this merger? >> Very positive. >> Yeah. >> Very positive. I mean OpenTracing and OpenCensus are both projects with healthy user bases that are growing quickly and all that, but the reason people adopt them is to future-proof their own software. Because they want to adopt something that's going to be here to stay. And by having these two things out in the world that are both successful, and were overlapping in terms of their goals, I think the presence of two projects was actually really problematic for people. So, the fact that they're merging is net positive, absolutely for the end user community, also for the vendor community, it's a similar, it's almost exactly the same parallel thought process. When we met, the CNCF did broker an in-person meeting where they gave us some space and we all got together and, I don't know how many people were there, like 20 or 30 people in that room. >> They did let us leave the room though, yesterday, yeah that was nice. >> They did let us leave the room, that's true. We were not locked in there, (Morgan laughing) but they asked us in the beginning, essentially they asked everyone to state what their goals were. And almost all of us really had the same goal, which is just to try and make it easy for end users to adopt a telemetry project that they can stick with for the long haul. And so when you think of it in that respect, the merger seems completely obvious. It is true that it doesn't happen very often, and we could speculate about why that is. But I think in this case it was enabled by the fact that we had pretty good social relationships with OpenCensus people. I think Twitter tends to amplify negativity in the world in general, as I'm sure people, not a controversial statement. >> News alert, wait, absolutely the negatives are, it's something in the algorithm I think. >> Yeah, yeah. >> Maybe they should fix that. >> Yeah, yeah (laughs) exactly. And it was funny, there was a lot of perceived animosity between OpenTracing and OpenCensus a year ago, nine months ago, but when you actually talk to the principals in the projects and even just the general purpose developers who are doing a huge amount of work for both projects, that wasn't a sentiment that was widely held or widely felt I think. So, it has been a very kind of happy, it's a huge relief frankly, this whole thing has been a huge relief for all of us I think. >> Yeah it feels like the general ask has always been that, for tracing that doesn't suck. And that tends to be a bit of a tall order. The way that they have seemed to have responded to it is a credit to the maturity of the community. And I think it also speaks to a growing realization that no one wants to have a monoculture of just one option, any color you want so long as it's black. (Ben laughing) Versus there's 500 different things you can pick that all stand in that same spot, and at that point analysis paralysis kicks in. So this feels like it's a net positive for, absolutely everyone involved. >> Definitely. Yeah, one of the anecdotes that Ben and I have shared throughout a lot of these interviews is there were a lot of projects that wanted to include distributed tracing in them. So various web frameworks, I think, was it Hadoop or HBase was-- >> HBase and HDFS were jointly deciding what to do about instrumentation. >> Yeah, and so they would publish an issue on GitHub and someone from OpenTracing would respond saying hey, OpenTracing does this. And they'd be like oh, that's interesting, we can go build an implementation file and issue, someone from OpenCensus would respond and say, no wait, you should use OpenCensus. And with these being very similar yet incompatible APIs, these groups like HBase would sit it and be like, this isn't mature enough, I don't want to deal with this, I've got more important things to focus on right now. And rather than even picking one and ignoring the other, they just ignored tracing, right? With things moving to microservices with Kubernetes being so popular, I mean just look at this conference. Distributed tracing is no longer this kind of nice to have when you're a big company, you need it to understand how your app works and understand the cause of an outage, the cause of a problem. And when you had organizations like this that were looking at tracing instrumentation saying this is a bit of joke with two competing projects, no one was being served well. >> All right, so you talked about there were incompatible APIs, so how do we get from where we were to where we're going? >> So I can talk about that a little bit. The APIs are conceptually incredibly similar. And the part of the criteria for any new language, for OpenTelemetry, are that we are able to build a software bridge to both OpenTracing and OpenCensus that will translate existing instrumentation alongside OpenTelemetry instrumentation, and omit the correct data at the end. And we've built that out in Java already and then starting working a few other languages. It's not a tremendously difficult thing to do if that's your goal. I've worked on this stuff, I started working on Dapper in 2004, so it's been 15 years that I've been working in this space, and I have a lot of regrets about what we did to OpenTracing. And I had this unbelievably tempting thing to start Greenfield like, let's do it right this time, and I'm suppressing every last impulse to do that. And the only goal for this project technically is backwards compatibility. >> Yeah. >> 100% backwards compatibility. There's the famous XKCD comic where you have 14 standards and someone says, we need to create a new standard that will unify across all 14 standards, and now you have 15 standards. So, we don't want to follow that pattern. And by having the leadership from OpenTracing and OpenCensus involved wholesale in this new effort, as well as having these compatibility bridges, we can avoid the fate of IPv6, of Python 3 and things like that. Where the new thing is very appealing but it's so far from the old thing that you literally can't get there incrementally. So that's, our entire design constraint is make sure that backwards compatibility works, get to one project and then we can think about the grand unifying theory of a provability-- >> Ben you are ruining the best thing about standards is that there is so many of them to choose from. (everyone laughing) >> There's still plenty more growing in other areas (laughs) just in this particular space it's smaller. >> One could argue that your approach is nonstandard in its own right. (Ben laughing) And in my own experiments with distributed tracing it seems like step one is, first you have to go back and instrument everything you've built. And step two, hey come back here, because that's a lot of work. The idea of an organization going back and reinstrumenting everything they've already instrumented the first time. >> It's unlikely. >> Unless they build things very modularly and very portably to do exactly that, it's a bit of a heavy lift. >> I agree, yeah, yeah. >> So going forward, are people who have deployed one or the other of your projects going to have to go back and do a reinstrumentation, or will they unify and continue to work as they are? >> So, I would pause at the, I don't know, I would be making up the statistic, so I shouldn't. But let's say a vast majority, I'm thinking like 95, 98% of instrumentation is actually embedded in frameworks and libraries that people depend on. So you need to get Dropwizard, and Spring, and Django, and Flask, and Kafka, things like that need to be instrumented. The application code, the instrumentation, that burden is a bit lower. We announced something called SpecialAgent at LightStep last week, separate to all of this. It's kind of a funny combination, a typical APM agent will interpose on individual function calls, which is a very complicated and heavyweight thing. This doesn't do any of that, but it takes, it basically surveys what you have in your process, it looks for OpenTracing, and in the future OpenTelemetry instrumentation that matches that, and then installs it for you. So you don't have to do any manual work, just basically gluing tab A into slot B or whatever, you don't have to do any of that stuff which is what most OpenTracing instrumentation actually looks like these days. And you can get off the ground without doing any code modifications. So, I think that direction, which is totally portable and vendor neutral as well, as a layer on top of telemetry makes a ton of sense. There are also data translation efforts that are part of OpenCensus that are being ported in to OpenTelemetry that also serve to repurpose existing sources of correlated data. So, all these things are ways to take existing software and get it into the new world without requiring any code changes or redeploys. >> The long-term goal of this has always been that because web framework and client library providers will go and build the instrumentation into those, that when you're writing your own service that you're deploying in Kubernetes or somewhere else, that by linking one of the OpenTelemetry implementations that you get all of that tracing and context propagation, everything out of the box. You as a sort of individual developer are only using the APIs to define custom metrics, custom spans, things that are specific to your business. >> So Ben, you didn't name LightStep the same as your project. But that being said, a major piece of your business is going through a change here, what does this mean for LightStep? >> That's actually not the way I see it for what it's worth. LightStep as a product, since you're giving me an opportunity to talk about it, (laughs) foolish move on your part. No, I'm just kidding. But LightStep as a product is totally omnivorous, we don't really care where the data comes from. And translating any source of data that has a correlation ID and a timestamp is a pretty trivial exercise for us. So we do support OpenTracing, we also support OpenCensus for what it's worth. We'll support OpenTelemetry, we support a bunch of weird in-house things people have already built. We don't care about that at all. The reason that we're pursuing OpenTelemetry is two-fold, one is that we do want to see high quality data coming out of projects. We said at the keynote this morning, but observability literally cannot be better than your telemetry. If your telemetry sucks, your observability will also suck. It's just definitionally true, if you go back to the definition of observability from the '60s. And so we want high quality telemetry so our product can be awesome. Also, just as an individual, I'm a nerd about this stuff and I just like it. I mean a lot of my motivation for working on this is that I personally find it gratifying. It's not really a commercial thing, I just like it. >> Do you find that, as you start talking about this more and more with companies that are becoming cloud-native rapidly, either through digital transformation or from springing fully formed from the forehead of some God, however these born in the cloud companies tend to be, that they intuitively are starting to grasp the value of tracing? Or does this wind up being a much heavier lift as you start, showing them the golden path as it were? >> It's definitely grown like I-- >> Well I think the value of tracing, you see that after you see the negative value of a really catastrophic outage. >> Yes. >> I mean I was just talking to a bank, I won't name the bank but a bank at this conference, and they were talking about their own adoption of tracing, which was pretty slow, until they had a really bad outage where they couldn't transact for an hour and they didn't know which of the 200 services was responsible for the issue. And that really put some muscle behind their tracing initiative. So, typically it's inspired by an incident like that, and then, it's a bit reactive. Sometimes it's not but either way you end up in that place eventually. >> I'm a strong proponent of distributed tracing and I feel very seen by your last answer. (Ben laughing) >> But it's definitely made a big impact. If you came to conferences like this two years ago you'd have Adrian, or Yuri or someone doing a talk on distributed tracing. And they would always start by asking the 100 to 200 person audience, who here knows what distributed tracing is? And like five people would raise their hand and everyone else would be like no, that's why I'm here at the talk, I want to find out about it. And you go to ones now, or even last year, and now they have 400 people at the talk and you ask, who knows what distributed tracing is? And last year over half the people would raise their hand, now it's going to be even higher. And I think just beyond even anecdotes, clearly businesses are finding the value because they're implementing it. And you can see that through the number of companies that have an interest in OpenTracing, OpenTelemetry, OpenCensus. You can see that in the growth of startups in this space, LightStep and others. >> The other thing I like about OpenTelemetry as a name, it's a bit of a mouthful but that's, it's important for people to understand the distinction between telemetry and tracing data and actual solutions. I mean OpenTelemetry stops when the correct data is being omitted. And then what you do with that data is your own business. And I also think that people are realizing that tracing is more than just visualizing a single distributed trace. >> Yeah. >> The traces have an enormous amount of information in there about resource usage, security patterns, access patterns, large-scale performance patterns that are embedded in thousands of traces, that sort of data is making its way into products as well. And I really like that OpenTelemetry has clearly delineated that it stops with the telemetry. OpenTracing was confusing for people, where they'd want tracing and they'd adopt OpenTracing, and then be like, where's my UI? And it's like well no, it's not that kind of project. With OpenTelemetry I think we've been very clear, this is about getting >> The name is more clear yeah. >> very high quality data in a portable way with minimal effort. And then you can use that in any number of ways, and I like that distinction, I think it's important. >> Okay so, how do we make sure that the combination of these two doesn't just get watered-down to the least common denominator, or that Ben just doesn't get upset and say, forget it, I'm going to start from scratch and do it right this time? (Ben laughing) >> I'm not sure I see either of those two happening. To your comment about the least common denominator, we're starting from what I was just commenting about like two years ago, from very little prior art. Like yeah, you had projects like Zipkin, and Zipkin had its own instrumentation, but it was just for tracing, it was just for Zipkin. And you had Jaeger with its own. And so, I think we're so far away, in a few years the least common denominator will be dramatically better than what we have today. (laughs) And so at this stage, I'm not even remotely worried about that. And secondly to some vendor, I know, because Ben had just exampled this, >> Some vendor, some vendor. >> that's probably not, probably not the best one. But for vendor interference in this projects, I really don't see it. Both because of what we talked about earlier where the vendors right now want more telemetry. I meet with them, Ben meets with 'em, we all meet with 'em all the time, we work with them. And the biggest challenge we have is just the data we get is bad, right? Either we don't support certain platforms, we'll get traces that dead end at certain places, we don't get metrics with the same name for certain types of telemetry. And so this project is going to fix that and it's going to solve this problem for a lot of vendors who have this, frankly, a really strong economic incentive to play ball, and to contribute to it. >> Do you see that this, I guess merging of the two projects, is offering an opportunity to either of you to fix some, or revisit if not fix, some of the mistakes, as they were, of the past? I know every time I build something I look back and it was frankly terrible because that's the kind of developer I am. But are you seeing this, as someone who's probably, presumably much better at developing than I've ever been, as the opportunity to unwind some of the decisions you made earlier on, out of either ignorance or it didn't work out as well as you hoped? >> There are a couple of things about each project that we see an opportunity to correct here without doing any damage to the compatibility story. For OpenTracing it was just a bit too narrow. I mean I would talk a lot about how we want to describe the software, not the tracing system. But we kind of made a mistake in that we called it OpenTracing. Really people want, if a request comes in, they want to describe that request and then have it go to their tracing system, but also to their metric system, and to their logging stack, and to anywhere else, their security system. You should only have to instrument that once. So, OpenTracing was a bit too narrow. OpenCensus, we've talked about this a lot, built a really high quality reference implementation into the product, if OpenCensus, the product I mean. And that coupling created problems for vendors to adopt and it was a bit thick for some end users as well. So we are still keeping the reference implementation, but it's now cleanly decoupled. >> Yeah. >> So we have loose coupling, a la OpenTracing, but wider scope a la OpenCensus. And in that aspect, I think philosophically, this OpenTelemetry effort has taken the best of both worlds from these two projects that it started with. >> All right well, Ben and Morgan thank you so much for sharing. Best of luck and let us know if CNCF needs to pull you guys in a room a little bit more to help work through any of the issues. (Ben laughing) But thanks again for joining us. >> Thank you so much. >> Thanks for having us, it's been a pleasure. >> Yeah. >> All right for Corey Quinn, I'm Stu Miniman we'll be back to wrap up our day one of two days live coverage here from KubeCon, CloudNativeCon 2019, Barcelona, Spain. Thanks for watching theCUBE. (soft instrumental music)

Published Date : May 21 2019

SUMMARY :

Brought to you by Red Hat, the Cloud Native Happy to welcome back to the program first Ben Sigelman, because you guys had some interesting news in the keynote. and really start the ball rolling, like 20 or 30 people in that room. They did let us leave the room though, And so when you think of it in that respect, in the algorithm I think. and even just the general purpose developers And that tends to be a bit of a tall order. Yeah, one of the anecdotes that Ben and I have shared HBase and HDFS were jointly deciding And rather than even picking one and ignoring the other, And the only goal for this project There's the famous XKCD comic where you have 14 standards is that there is so many of them to choose from. growing in other areas (laughs) just in this One could argue that your to do exactly that, it's a bit of a heavy lift. and get it into the new world without requiring that by linking one of the OpenTelemetry implementations But that being said, a major piece of your business one is that we do want to see high quality data you see that after you see the negative value And that really put some muscle and I feel very seen by your last answer. You can see that in the growth of startups And then what you do with that data is your own business. And I really like that OpenTelemetry has clearly delineated and I like that distinction, I think it's important. And you had Jaeger with its own. Some vendor, And so this project is going to fix that and it's going to solve is offering an opportunity to either of you to fix some, and then have it go to their tracing system, And in that aspect, I think philosophically, Best of luck and let us know if CNCF needs to pull you guys Thanks for having us, Thanks for watching theCUBE.

ENTITIES

Entity	Category	Confidence
Ben Sigelman	PERSON	0.99+
2004	DATE	0.99+
Corey Quinn	PERSON	0.99+
Stu Miniman	PERSON	0.99+
Morgan	PERSON	0.99+
20	QUANTITY	0.99+
Ben	PERSON	0.99+
Red Hat	ORGANIZATION	0.99+
Cloud Native Computing Foundation	ORGANIZATION	0.99+
Stu	PERSON	0.99+
100	QUANTITY	0.99+
Python 3	TITLE	0.99+
two projects	QUANTITY	0.99+
yesterday	DATE	0.99+
last year	DATE	0.99+
Java	TITLE	0.99+
five people	QUANTITY	0.99+
15 years	QUANTITY	0.99+
thousands	QUANTITY	0.99+
LightStep	ORGANIZATION	0.99+
Adrian	PERSON	0.99+
last week	DATE	0.99+
both	QUANTITY	0.99+
400 people	QUANTITY	0.99+
two days	QUANTITY	0.99+
KubeCon	EVENT	0.99+
30 people	QUANTITY	0.99+
Morgan McLean	PERSON	0.99+
two	QUANTITY	0.99+
200 services	QUANTITY	0.99+
each project	QUANTITY	0.99+
CNCF	ORGANIZATION	0.99+
nine months ago	DATE	0.99+
Yuri	PERSON	0.99+
two things	QUANTITY	0.99+
OpenCensus	TITLE	0.99+
Both	QUANTITY	0.99+
Twitter	ORGANIZATION	0.99+
one	QUANTITY	0.99+
OpenCensus	ORGANIZATION	0.99+
Barcelona, Spain	LOCATION	0.99+
OpenTracing	TITLE	0.99+
CloudNativeCon	EVENT	0.98+
two years ago	DATE	0.98+
95, 98%	QUANTITY	0.98+
200 person	QUANTITY	0.98+
Ecosystem Partners	ORGANIZATION	0.98+
one option	QUANTITY	0.98+
one project	QUANTITY	0.98+
first time	QUANTITY	0.98+
two-fold	QUANTITY	0.98+
both projects	QUANTITY	0.97+
six	DATE	0.97+
Google	ORGANIZATION	0.97+
two years ago	DATE	0.97+
15 standards	QUANTITY	0.97+
first	QUANTITY	0.97+
LightStep	TITLE	0.96+
GitHub	ORGANIZATION	0.96+
CloudNativeCon 2019	EVENT	0.96+
'60s	DATE	0.96+
OpenTracing	ORGANIZATION	0.96+
Zipkin	ORGANIZATION	0.96+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Python 3: