Ben Amor, Palantir, and Sam Michael, NCATS | AWS PS Partner Awards 2021
>>Mhm Hello and welcome to the cubes coverage of AWS amazon web services, Global public Sector partner awards program. I'm john for your host of the cube here we're gonna talk about the best covid solution to great guests. Benham or with healthcare and life sciences lead at palantir Ben welcome to the cube SAm Michaels, Director of automation and compound management and Cats. National Center for advancing translational sciences and Cats. Part of the NIH National sort of health Gentlemen, thank you for coming on and and congratulations on the best covid solution. >>Thank you so much john >>so I gotta, I gotta ask you the best solution is when can I get the vaccine? How fast how long it's gonna last but I really appreciate you guys coming on. I >>hope you're vaccinated. I would say john that's outside of our hands. I would say if you've not got vaccinated, go get vaccinated right now, have someone stab you in the arm, you know, do not wait and and go for it. That's not on us. But you got that >>opportunity that we have that done. I got to get on a plane and all kinds of hoops to jump through. We need a better solution anyway. You guys have a great technical so I wanna I wanna dig in all seriousness aside getting inside. Um you guys have put together a killer solution that really requires a lot of data can let's step back and and talk about first. What was the solution that won the award? You guys have a quick second set the table for what we're talking about. Then we'll start with you. >>So the national covered cohort collaborative is a secure data enclave putting together the HR records from more than 60 different academic medical centers across the country and they're making it available to researchers to, you know, ask many and varied questions to try and understand this disease better. >>See and take us through the challenges here. What was going on? What was the hard problem? I'll see everyone had a situation with Covid where people broke through and cloud as he drove it amazon is part of the awards, but you guys are solving something. What was the problem statement that you guys are going after? What happened? >>I I think the problem statement is essentially that, you know, the nation has the electronic health records, but it's very fragmented, right. You know, it's been is highlighted is there's there's multiple systems around the country, you know, thousands of folks that have E H. R. S. But there is no way from a research perspective to actually have access in any unified location. And so really what we were looking for is how can we essentially provide a centralized location to study electronic health records. But in a Federated sense because we recognize that the data exist in other locations and so we had to figure out for a vast quantity of data, how can we get data from those 60 sites, 60 plus that Ben is referencing from their respective locations and then into one central repository, but also in a common format. Because that's another huge aspect of the technical challenge was there's multiple formats for electronic health records, there's different standards, there's different versions. And how do you actually have all of this data harmonised into something which is usable again for research? >>Just so many things that are jumping in my head right now, I want to unpack one at the time Covid hit the scramble and the imperative for getting answers quickly was huge. So it's a data problem at a massive scale public health impact. Again, we were talking before we came on camera, public health records are dirty, they're not clean. A lot of things are weird. I mean, just just massive amount of weird problems. How did you guys pull together take me through how this gets done? What what happened? Take us through the the steps He just got together and said, let's do this. How does it all happen? >>Yeah, it's a great and so john, I would say so. Part of this started actually several years ago. I explain this when people talk about in three C is that and Cats has actually established what we like to call, We support a program which is called the Clinical translation Science Award program is the largest single grant program in all of NIH. And it constitutes the bulk of the Cats budget. So this is extra metal grants which goes all over the country. And we wanted this group to essentially have a common research environment. So we try to create what we call the secure scientific collaborative platforms. Another example of this is when we call the rare disease clinical research network, which again is a consortium of 20 different sites around the nation. And so really we started working this several years ago that if we want to Build an environment that's collaborative for researchers around the country around the world, the natural place to do that is really with a cloud first strategy and we recognize this as and cats were about 600 people now. But if you look at the size of our actual research community with our grantees were in the thousands. And so from the perspective that we took several years ago was we have to really take a step back. And if we want to have a comprehensive and cohesive package or solution to treat this is really a mid sized business, you know, and so that means we have to treat this as a cloud based enterprise. And so in cats several years ago had really gone on this strategy to bring in different commercial partners, of which one of them is Palin tear. It actually started with our intramural research program and obviously very heavy cloud use with AWS. We use your we use google workspace, essentially use different cloud tools to enable our collaborative researchers. The next step is we also had a project. If we want to have an environment, we have to have access. And this is something that we took early steps on years prior that there is no good building environment if people can't get in the front door. So we invested heavily and create an application which we call our Federated authentication system. We call it unified and cats off. So we call it, you know, for short and and this is the open source in house project that we built it and cats. And we wanted to actually use this for all sorts of implementation, acting as the front door to this collaborative environment being one of them. And then also by by really this this this interest in electronic health records that had existed prior to the Covid pandemic. And so we've done some prior work via mixture of internal investments in grants with collaborative partners to really look at what it would take to harmonize this data at scale. And so like you mentioned, Covid hit it. Hit really hard. Everyone was scrambling for answers. And I think we had a bit of these pieces um, in play. And then that's I think when we turned to ban and the team at volunteer and we said we have these components, we have these pieces what we really need. Something independent that we can stand up quickly to really address some of these problems. One of the biggest one being that data ingestion and the harmonization step. And so I can let Ben really speak to that one. >>Yeah. Ben Library because you're solving a lot of collaboration problems, not just the technical problem but ingestion and harmonization ingestion. Most people can understand is that the data warehousing or in the database know that what that means? Take us through harmonization because not to put a little bit of shade on this, but most people think about, you know, these kinds of research or non profits as a slow moving, you know, standing stuff up sandwich saying it takes time you break it down. By the time you you didn't think things are over. This was agile. So take us through what made it an agile because that's not normal. I mean that's not what you see normally. It's like, hey we'll see you next year. We stand that up. Yeah. At the data center. >>Yeah, I mean so as as Sam described this sort of the question of data on interoperability is a really essential problem for working with this kind of data. And I think, you know, we have data coming from more than 60 different sites and one of the reasons were able to move quickly was because rather than saying oh well you have to provide the data in a certain format, a certain standard. Um and three C. was able to say actually just give us the data how you have it in whatever format is easiest for you and we will take care of that process of actually transforming it into a single standard data model, converting all of the medical vocabularies, doing all of the data quality assessment that's needed to ensure that data is actually ready for research and that was very much a collaborative endeavor. It was run out of a team based at johns Hopkins University, but in collaboration with a broad range of researchers who are all adding their expertise and what we were able to do was to provide the sort of the technical infrastructure for taking the transformation pipelines that are being developed, that the actual logic and the code and developing these very robust kind of centralist templates for that. Um, that could be deployed just like software is deployed, have changed management, have upgrades and downgrades and version control and change logs so that we can roll that out across a large number of sites in a very robust way very quickly. So that's sort of that, that that's one aspect of it. And then there was a bunch of really interesting challenges along the way that again, a very broad collaborative team of researchers worked on and an example of that would be unit harmonization and inference. So really simple things like when a lab result arrives, we talked about data quality, um, you were expected to have a unit right? Like if you're reporting somebody's weight, you probably want to know if it's in kilograms or pounds, but we found that a very significant proportion of the time the unit was actually missing in the HR record. And so unless you can actually get that back, that becomes useless. And so an approach was developed because we had data across 60 or more different sites, you have a large number of lab tests that do have the correct units and you can look at the data distributions and decide how likely is it that this missing unit is actually kilograms or pounds and save a huge portion of these labs. So that's just an example of something that has enabled research to happen that would not otherwise have been able >>just not to dig in and rat hole on that one point. But what time saving do you think that saves? I mean, I can imagine it's on the data cleaning side. That's just a massive time savings just in for Okay. Based on the data sampling, this is kilograms or pounds. >>Exactly. So we're talking there's more than 3.5 billion lab records in this data base now. So if you were trying to do this manually, I mean, it would take, it would take to thousands of years, you know, it just wouldn't be a black, it would >>be a black hole in the dataset, essentially because there's no way it would get done. Ok. Ok. Sam take me through like from a research standpoint, this normalization, harmonization the process. What does that enable for the, for the research and who decides what's the standard format? So, because again, I'm just in my mind thinking how hard this is. And then what was the, what was decided? Was it just on the base records what standards were happening? What's the impact of researchers >>now? It's a great quite well, a couple things I'll say. And Ben has touched on this is the other real core piece of N three C is the community, right? You know, And so I think there's a couple of things you mentioned with this, johN is the way we execute this is, it was very nimble, it was very agile and there's something to be said on that piece from a procurement perspective, the government had many covid authorities that were granted to make very fast decisions to get things procured quickly. And we were able to turn this around with our acquisition shop, which we would otherwise, you know, be dead in the water like you said, wait a year ago through a normal acquisition process, which can take time, but that's only one half the other half. And really, you're touching on this and Ben is touching on this is when he mentions the research as we have this entire courts entire, you know, research community numbering in the thousands from a volunteer perspective. I think it's really fascinating. This is a really a great example to me of this public private partnership between the companies we use, but also the academic participants that are actually make up the community. Um again, who the amount of time they have dedicated on this is just incredible. So, so really, what's also been established with this is core governance. And so, you know, you think from assistance perspective is, you know, the Palin tear this environment, the N three C environment belongs to the government, but the N 33 the entire actually, you know, program, I would say, belongs to the community. We have co governance on this. So who decides really is just a mixture between the folks on End Cats, but not just end cast as folks at End Cats, folks that, you know, and I proper, but also folks and other government agencies, but also the, the academic communities and entire these mixed governance teams that actually set the stage for all of this. And again, you know, who's gonna decide the standard, We decide we're gonna do this in Oman 5.3 point one um is the standard we're going to utilize. And then once the data is there, this is what gets exciting is then they have the different domain teams where they can ask different research questions depending upon what has interest scientifically to them. Um and so really, you know, we viewed this from the government's perspective is how do we build again the secure platform where we can enable the research, but we don't really want to dictate the research. I mean, the one criteria we did put your research has to be covid focused because very clearly in response to covid, so you have to have a Covid focus and then we have data use agreements, data use request. You know, we have entire governance committees that decide is this research in scope, but we don't want to dictate the research types that the domain teams are bringing to the table. >>And I think the National Institutes of Health, you think about just that their mission is to serve the public health. And I think this is a great example of when you enable data to be surfaced and available that you can really allow people to be empowered and not to use the cliche citizen analysts. But in a way this is what the community is doing. You're doing research and allowing people from volunteers to academics to students to just be part of it. That is citizen analysis that you got citizen journalism. You've got citizen and uh, research, you've got a lot of democratization happening here. Is that part of it was a result of >>this? Uh, it's both. It's a great question. I think it's both. And it's it's really by design because again, we want to enable and there's a couple of things that I really, you know, we we clamor with at end cats. I think NIH is going with this direction to is we believe firmly in open science, we believe firmly in open standards and how we can actually enable these standards to promote this open science because it's actually nontrivial. We've had, you know, the citizen scientists actually on the tricky problem from a governance perspective or we have the case where we actually had to have students that wanted access to the environment. Well, we actually had to have someone because, you know, they have to have an institution that they come in with, but we've actually across some of those bridges to actually get students and researchers into this environment very much by design, but also the spirit which was held enabled by the community, which, again, so I think they go they go hand in hand. I planned for >>open science as a huge wave, I'm a big fan, I think that's got a lot of headroom because open source, what that's done to software, the software industry, it's amazing. And I think your Federated idea comes in here and Ben if you guys can just talk through the Federated, because I think that might enable and remove some of the structural blockers that might be out there in terms of, oh, you gotta be affiliate with this or that our friends got to invite you, but then you got privacy access and this Federated ID not an easy thing, it's easy to say. But how do you tie that together? Because you want to enable frictionless ability to come in and contribute same time you want to have some policies around who's in and who's not. >>Yes, totally, I mean so Sam sort of already described the the UNa system which is the authentication system that encounters has developed. And obviously you know from our perspective, you know we integrate with that is using all of the standard kind of authentication protocols and it's very easy to integrate that into the family platform um and make it so that we can authenticate people correctly. But then if you go beyond authentication you also then to actually you need to have the access controls in place to say yes I know who this person is, but now what should they actually be able to see? Um And I think one of the really great things in Free C has done is to be very rigorous about that. They have their governance rules that says you should be using the data for a certain purpose. You must go through a procedure so that the access committee approves that purpose. And then we need to make sure that you're actually doing the work that you said you were going to. And so before you can get your data back out of the system where your results out, you actually have to prove that those results are in line with the original stated purpose and the infrastructure around that and having the access controls and the governance processes, all working together in a seamless way so that it doesn't, as you say, increase the friction on the researcher and they can get access to the data for that appropriate purpose. That was a big component of what we've been building out with them three C. Absolutely. >>And really in line john with what NIH is doing with the research, all service, they call this raz. And I think things that we believe in their standards that were starting to follow and work with them closely. Multifactor authentication because of the point Ben is making and you raised as well, you know, one you need to authenticate, okay. This you are who you say you are. And and we're recognizing that and you're, you know, the author and peace within the authors. E what do you authorized to see? What do you have authorization to? And they go hand in hand and again, non trivial problems. And especially, you know, when we basis typically a lot of what we're using is is we'll do direct integrations with our package. We using commons for Federated access were also even using login dot gov. Um, you know, again because we need to make sure that people had a means, you know, and login dot gov is essentially a runoff right? If they don't have, you know an organization which we have in common or a Federated access to generate a login dot gov account but they still are whole, you know beholden to the multi factor authentication step and then they still have to get the same authorizations because we really do believe access to these environment seamlessly is absolutely critical, you know, who are users are but again not make it restrictive and not make it this this friction filled process. That's very that's very >>different. I mean you think about nontrivial, totally agree with you and if you think about like if you were in a classic enterprise, I thought about an I. T. Problem like bring your own device to work and that's basically what the whole world does these days. So like you're thinking about access, you don't know who's coming in, you don't know where they're coming in from, um when the churn is so high, you don't know, I mean all this is happening, right? So you have to be prepared two Provisions and provide resource to a very lightweight access edge. >>That's right. And that's why it gets back to what we mentioned is we were taking a step back and thinking about this problem, you know, an M three C became the use case was this is an enterprise I. T. Problem. Right. You know, we have users from around the world that want to access this environment and again we try to hit a really difficult mark, which is secure but collaborative, Right? That's that's not easy, you know? But but again, the only place this environment could take place isn't a cloud based environment, right? Let's be real. You know, 10 years ago. Forget it. You know, Again, maybe it would have been difficult, but now it's just incredible how much they advanced that these real virtual research organizations can start to exist and they become the real partnerships. >>Well, I want to Well, that's a great point. I want to highlight and call out because I've done a lot of these interviews with awards programs over the years and certainly in public sector and open source over many, many years. One of the things open source allows us the code re use and also when you start getting in these situations where, okay, you have a crisis covid other things happen, nonprofits go, that's the same thing. They, they lose their funding and all the code disappears. Saying with these covid when it becomes over, you don't want to lose the momentum. So this whole idea of re use this platform is aged deplatforming of and re factoring if you will, these are two concepts with a cloud enables SAM, I'd love to get your thoughts on this because it doesn't go away when Covid's >>over, research still >>continues. So this whole idea of re platform NG and then re factoring is very much a new concept versus the old days of okay, projects over, move on to the next one. >>No, you're absolutely right. And I think what first drove us is we're taking a step back and and cats, you know, how do we ensure that sustainability? Right, Because my background is actually engineering. So I think about, you know, you want to build things to last and what you just described, johN is that, you know, that, that funding, it peaks, it goes up and then it wanes away and it goes and what you're left with essentially is nothing, you know, it's okay you did this investment in a body of work and it goes away. And really, I think what we're really building are these sustainable platforms that we will actually grow and evolve based upon the research needs over time. And I think that was really a huge investment that both, you know, again and and Cats is made. But NIH is going in a very similar direction. There's a substantial investment, um, you know, made in these, these these these really impressive environments. How do we make sure the sustainable for the long term? You know, again, we just went through this with Covid, but what's gonna come next? You know, one of the research questions that we need to answer, but also open source is an incredibly important piece of this. I think Ben can speak this in a second, all the harmonization work, all that effort, you know, essentially this massive, complex GTL process Is in the N three Seagate hub. So we believe, you know, completely and the open source model a little bit of a flavor on it too though, because, you know, again, back to the sustainability, john, I believe, you know, there's a room for this, this marriage between commercial platforms and open source software and we need both. You know, as we're strong proponents of N cats are both, but especially with sustainability, especially I think Enterprise I. T. You know, you have to have professional grade products that was part of, I would say an experiment we ran out and cast our thought was we can fund academic groups and we can have them do open source projects and you'll get some decent results. But I think the nature of it and the nature of these environments become so complex. The experiment we're taking is we're going to provide commercial grade tools For the academic community and the researchers and let them use them and see how they can be enabled and actually focus on research questions. And I think, you know, N3C, which we've been very successful with that model while still really adhering to the open source spirit and >>principles as an amazing story, congratulated, you know what? That's so awesome because that's the future. And I think you're onto something huge. Great point, Ben, you want to chime in on this whole sustainability because the public private partnership idea is the now the new model innovation formula is about open and collaborative. What's your thoughts? >>Absolutely. And I mean, we uh, volunteer have been huge proponents of reproducibility and openness, um in analyses and in science. And so everything done within the family platform is done in open source languages like python and R. And sequel, um and is exposed via open A. P. I. S and through get repository. So that as SaM says, we've we've pushed all of that E. T. L. Code that was developed within the platform out to the cats get hub. Um and the analysis code itself being written in those various different languages can also sort of easily be pulled out um and made available for other researchers in the future. And I think what we've also seen is that within the data enclave there's been an enormous amount of re use across the different research projects. And so actually having that security in place and making it secure so that people can actually start to share with each other securely as well. And and and be very clear that although I'm sharing this, it's still within the range of the government's requirements has meant that the, the research has really been accelerated because people have been able to build and stand on the shoulders of what earlier projects have done. >>Okay. Ben. Great stuff. 1000 researchers. Open source code and get a job. Where do I sign up? I want to get involved. This is amazing. Like it sounds like a great party. >>We'll send you a link if you do a search on on N three C, you know, do do a search on that and you'll actually will come up with a website hosted by the academic side and I'll show you all the information of how you can actually connect and john you're welcome to come in. Billion by all means >>billions of rows of data being solved. Great tech he's working on again. This is a great example of large scale the modern era of solving problems is here. It's out in the open, Open Science. Sam. Congratulations on your great success. Ben Award winners. You guys doing a great job. Great story. Thanks for sharing here with us in the queue. Appreciate it. >>Thank you, john. >>Thanks for having us. >>Okay. It is. Global public sector partner rewards best Covid solution palantir and and cats. Great solution. Great story. I'm john Kerry with the cube. Thanks for watching. Mm mm. >>Mhm
SUMMARY :
thank you for coming on and and congratulations on the best covid solution. so I gotta, I gotta ask you the best solution is when can I get the vaccine? go get vaccinated right now, have someone stab you in the arm, you know, do not wait and and go for it. Um you guys have put together a killer solution that really requires a lot of data can let's step you know, ask many and varied questions to try and understand this disease better. What was the problem statement that you guys are going after? I I think the problem statement is essentially that, you know, the nation has the electronic health How did you guys pull together take me through how this gets done? or solution to treat this is really a mid sized business, you know, and so that means we have to treat this as a I mean that's not what you see normally. do have the correct units and you can look at the data distributions and decide how likely do you think that saves? it would take, it would take to thousands of years, you know, it just wouldn't be a black, Was it just on the base records what standards were happening? And again, you know, who's gonna decide the standard, We decide we're gonna do this in Oman 5.3 And I think this is a great example of when you enable data to be surfaced again, we want to enable and there's a couple of things that I really, you know, we we clamor with at end ability to come in and contribute same time you want to have some policies around who's in and And so before you can get your data back out of the system where your results out, And especially, you know, when we basis typically I mean you think about nontrivial, totally agree with you and if you think about like if you were in a classic enterprise, you know, an M three C became the use case was this is an enterprise I. T. Problem. One of the things open source allows us the code re use and also when you start getting in these So this whole idea of re platform NG and then re factoring is very much a new concept And I think, you know, N3C, which we've been very successful with that model while still really adhering to Great point, Ben, you want to chime in on this whole sustainability because the And I think what we've also seen is that within the data enclave there's I want to get involved. will come up with a website hosted by the academic side and I'll show you all the information of how you can actually connect and It's out in the open, Open Science. I'm john Kerry with the cube.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
NIH | ORGANIZATION | 0.99+ |
National Institutes of Health | ORGANIZATION | 0.99+ |
Sam Michael | PERSON | 0.99+ |
Palantir | PERSON | 0.99+ |
john Kerry | PERSON | 0.99+ |
Sam | PERSON | 0.99+ |
Ben | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
one | QUANTITY | 0.99+ |
1000 researchers | QUANTITY | 0.99+ |
Ben Amor | PERSON | 0.99+ |
thousands | QUANTITY | 0.99+ |
60 sites | QUANTITY | 0.99+ |
both | QUANTITY | 0.99+ |
next year | DATE | 0.99+ |
60 | QUANTITY | 0.99+ |
amazon | ORGANIZATION | 0.99+ |
more than 60 different sites | QUANTITY | 0.99+ |
johns Hopkins University | ORGANIZATION | 0.99+ |
thousands of years | QUANTITY | 0.99+ |
python | TITLE | 0.99+ |
20 different sites | QUANTITY | 0.99+ |
SAm Michaels | PERSON | 0.99+ |
more than 60 different academic medical centers | QUANTITY | 0.99+ |
johN | PERSON | 0.99+ |
john | PERSON | 0.99+ |
Covid pandemic | EVENT | 0.98+ |
several years ago | DATE | 0.98+ |
one criteria | QUANTITY | 0.98+ |
more than 3.5 billion lab records | QUANTITY | 0.98+ |
N3C | ORGANIZATION | 0.98+ |
first | QUANTITY | 0.98+ |
10 years ago | DATE | 0.98+ |
Globa | PERSON | 0.98+ |
60 plus | QUANTITY | 0.98+ |
two concepts | QUANTITY | 0.97+ |
first strategy | QUANTITY | 0.97+ |
a year ago | DATE | 0.96+ |
R. | TITLE | 0.96+ |
thousands of folks | QUANTITY | 0.96+ |
One | QUANTITY | 0.96+ |
one aspect | QUANTITY | 0.96+ |
agile | TITLE | 0.95+ |
about 600 people | QUANTITY | 0.94+ |
AWS | EVENT | 0.94+ |
single grant program | QUANTITY | 0.94+ |
Covid | PERSON | 0.92+ |
ORGANIZATION | 0.91+ | |
second | QUANTITY | 0.91+ |
Free C | TITLE | 0.9+ |
one point | QUANTITY | 0.9+ |
End Cats | ORGANIZATION | 0.89+ |
National Center for advancing translational sciences and Cats | ORGANIZATION | 0.89+ |
Billion | QUANTITY | 0.88+ |
Seagate | ORGANIZATION | 0.88+ |
one half | QUANTITY | 0.88+ |
two Provisions | QUANTITY | 0.86+ |
one central repository | QUANTITY | 0.85+ |
login dot gov. | OTHER | 0.84+ |
Federated | ORGANIZATION | 0.84+ |
dot gov | OTHER | 0.83+ |
palantir | PERSON | 0.83+ |
billions of rows of data | QUANTITY | 0.82+ |