Ben Miller, Recursion Pharmaceuticals | Splunk .conf 2017

>> Announcer: Live, from Washington DC, it's theCube. Covering .conf2017 Brought to you by splunk. >> Welcome back inside the Walter Washington Convention Center. We're at .conf2017 in Washington DC, the nations capital, it is alive and well and thriving. A little warm out there, almost 90 degrees. But hot topic inside here, Dave. >> There's a lot of heat in this city. (laughter) >> A lot of hot air. >> Yeah, absolutely. >> We'll just leave it at that. Politics aside, of course. Joining us is Ben Miller, who is Director of High Thoughput Screening at Recursion Pharmaceuticals. Ben, thanks for being with us here on theCube. We appreciate the time. First off, I have many questions. First off let's talk about the company, what you do, and then what high throughput screening means, and how that operation comes into play when you have this great nexus of biology and engineering that you've brought together. >> Recursion Pharmaceuticals is treating drug discovery as a facial recognition problem. We're applying machine-learning concepts to biological images to help detect what types of drugs can rescue what types of diseases. We're one of the few companies that is both generating and analyzing our own data. As the director of the high throughput screening group, what I do is generate images for our data science teams to analyze, and that means growing human cells up in massive quantities, perturbing them with different types of disease reagents that cause their morphology to change, and then photographing them in the presence of compounds and in the absence of compounds. So we can see which compounds cause these disease states to revert more to a normal state for the cell. >> Okay, HTS then ... Walk us through that if you would. >> HTS is a general term that's used in the pharmaceutical industry to denote a assay that is executed in very large scale and in parallel. We tend to work on the order of multiples of 384 experiments per plate. We're looking at hundreds of thousands of images per plate, and we're looking at hundreds of plates per week. So when we say high throughput, we mean 6-10 terabytes of data per day. >> Just extraordinary amounts of data. And the mission, as we understand it, you're looking at very rare genetic diseases, your goal is to find cures for these over the next 15-20 years. Up to 100 of them, so that's why you're going through this multiple examinations of vast amounts of data. Human data. >> Yeah, there's been a trend in the pharmaceutical industry over the last years, where the number of dollars spent per drug developed is increasing. And it now takes over one billion dollars to bring a drug to market. And every year it costs more to bring a drug to market. We believe we can change that by operating at a massively parallel scale and also analyzing image data at a truly deep level. Looking at thousands of different features per image, instead of just a single feature in the image. >> That business is just like this vicious cycle going on, and you guys are trying to break it. >> Yes, exactly. >> So what's the state of facial recognition been? I've had mixed reviews about it. Because I rave about it, I go, "Oh my God, "Facebook tagged me again, it must be really good." And then other's have told me, "Well it's not really "as reliable as you might think." What is your experience been? >> The only experience I've had with facial recognition has been like yours, on Facebook and things like that. What we're doing is looking more at cellular recognition. Being able to see differences in these cellular morphologies. I think there are some unique challenges when you're looking at images of thousands of cells, versus images of a single person's face. >> Okay, so you've taken that concept down to the cell level and it's highly accurate, presumably. >> It's highly reproducible is what I would say, yeah. >> So it takes some work to be accurate, and once you get it there you can reproduce that, is that right? How does the sequence work? >> Yes, so there are two parts to the coin. One is how consistently we can produce these images and then how consistently those images represent the disease state. My focus is on making the images as consistent as they can be, while realizing that the disease states are all unique. So from our perspective, we're looking at thousands of different features in each image, and figuring out how consistent those features are from image to image. >> So paint a picture of your data stack, if you will. Infrastructure on up to the apps, and where splunk fits in. >> Sure. So I guess you could say that our data stack actually begins at hospitals around the world where human cells are collected from various medical waste samples. We culture those up, perturb them with different reagents, add different potential drugs back to them, and then photograph them. So at the beginning of our stack we've got biological agents that are mixed together and then photographs are generated. Those photographs are actually .tif files, and we have thousands and thousands of them. They're all uploaded in to Amazon Web Services, their S3 system. We spin up a near infinite number of virtual computers to process all of that image data within a couple of hours. And then produce a result. This drug makes this disease model look more like healthy and doesn't have other side effects. We're really reducing those thousands of dimensions in our image down to two. How much does it look like a healthy cell, and how much does it just look different then it should. >> And where does splunk fit into that stack? >> All of those instruments that are generating that data are equipped with splunk forwarders. So splunk is pulling all of our operational data from the laboratory together, and marrying it up with the image analysis that comes from our proprietary data analysis system. So by looking at the data that we're generating, how many cells we're counting, how bright the intensity of the image is, comparing that back to which dispenser we used, how long the plates sat at room temperature, et cetera. We can figure out how to optimize our production process so that we get reliable data. >> It's essentially storing machine data in the splunk data store. And then do you have an image database for ...? >> Yeah. And the image database is incredibly large. I wouldn't even guess at the current size. >> Dave: And what is it? Is it something on Amazon, an Amazon service? >> Yeah. So right now all of our image data is stored on AWS. >> This is one of those interviews Dave that the subject matter kind of trumps the technology because I want to know how it works. But you need the technology obviously to drive it. So I'm trying to figure out, "Alright, so you're taking "human cells and you're taking snapshots in time, "and then looking at how they react "to certain perturbed actions." But how does that picture of maybe one person's cell reacting to a reagent to another person's ... How does your data analysis provide you with some insight because Dave's DNA is different from my DNA, different from everybody in this building, so ultimately how are you combing through all of that data to make sense of it. >> That's true. Everybody has a unique genetic fingerprint, but everybody is susceptible to the same sets of major diseases. By looking at these images, and really that's the billion dollar question, is how representative are these individual cellular images, how representative are they of the general human population? And the effects that we see at a cellular level, will they translate in to human populations? We're very close to clinical trials on several compounds, but that's when we will really find out how much proof there is in this concept. >> Okay. You can't really predict ... Do you have a timeframe or is just sort of, "Keep going, keep getting funding until you reach the answer?" Is it like survive until you thrive? >> I personally don't maintain that kind of timeline. My role is within the laboratory producing the data as quickly as we can. We do have a goal of treating 100 different diseases in the next 10 years. And it's really early days, we're about 2 1/2 years in to that goal. It seems like we're on track, but there's still a lot of work to be done between now and then. >> So it's all cloud, right? And then splunk is throughout that stack, as we talked about. How do you envision, or do you envision, using it differently? Are you trying to get more out of the splunk platform? What do you want to see from splunk? >> That's a good question. I think right now we're using really the rudimentary basic features of splunk. Their database-connect app and their Machine Learning Toolkit are both pretty foundational to the work that we do. But right now a lot of our data models are one time use. We do a particular analysis to find the root cause of a particular problem, we learn that, and that's the last time we use that model. Continuous implementation of data models is something that is high on my list to do. As well as just ingesting more and more data. We're still fairly siloed. Our temperature and humidity data is separate from our machine data, and bringing that all into splunk is on the list. >> Why are your models disposable? It sounds like it's not done on purpose, it's more of some kind of infrastructure barrier? >> We're really at the cutting edge of technology right now, and we're learning a lot of things that people haven't learned, that in retrospect are obvious. To figure out the true cause of a particular situation, a data model or a machine-learning model is really valuable, but once you know that key salient fact, you don't need to keep track of it over time. You don't need to know that when your tire pressure is low your car gets less miles to the gallon. >> David: You have the answer. >> Right. But there are a lot of problems like that in our field that have not been discovered yet. >> I inferred from your answer you do see the potential to have some kind of ongoing model evolution. For new use cases? >> In the extreme situation we have a set of hundreds of operational parameters that are going into producing this image of cells. And then we have thousands of cellular features that are extracted from that image. There's a machine-learning problem there. What are the optimal parameters to extract the optimal information? And that whole process could be automated to the point where we're using machine-learning to optimize our assay. To me that's the future of what we want to do. >> Were you with Recursion when they brought in splunk? >> Yeah. >> You were. Did you look at alternatives? Did you look at maybe rolling your own with open source? Is that even feasible? Wonder if you could talk about that. >> I had already been introduced to splunk at my previous job, and at that previous company, before I heard of splunk, I was starting to roll my own. I was writing a ton of Perl scripts, and all of these regular expressions, and searching network drives to pull log files together. And I thought that maybe there would be a good business model behind that. >> You were building splunk. (laughter) >> And then I found splunk, and those guys were so far ahead of things I was trying to do on my own in a lab. So for me it was a no-brainer. But for our software engineering team, they are really dedicated to open source platforms whenever possible. They evaluated the ELK Stack. Some of us had used Sumo Logic and things like that. But for me, splunk had the right license model and I could get off the ground really really rapidly with it. >> What about the license model was attractive to you? >> Unlimited users, and only paying for the data that we ingest. The ability to democratize that data, so that everybody in the lab can go in and view it and I don't have to worry about how many accounts I'm creating. That was really powerful. >> Dave: So you like the pricing model. >> Yeah. >> Some users have chirped about the pricing, I saw some Wall Street concerns about the pricing. The guys that we've talked to on theCube today have said, "They like the pricing model, that there's value there." And you're sort of confirming that. >> Ben: Yeah. >> You're not concerned about the exponential growth of you data causing your license fees to go through the roof >> In the laboratory, the image data that we're generating is exponentially growing, but the operational parameter data is more linearly growing. >> Dave: So it's under control basically. >> Yeah, for our needs it is. >> Dave: You're not paying for the images, you're paying for the meta data around that. >> Yeah. >> Well it's a fascinating proposition, it really is. Very eager to keep up with this, keep track, and see the progress. Good luck with that. Look for having you back on theCube to monitor that progress, alright Ben? >> Great. Very good, thank you so much. Ben Miller joining us from Salt Lake City, good to have you here. Back with more on theCube in just a bit. You're watching our live coverage of .conf2017. (upbeat innovative music)

Published Date : Sep 27 2017

SUMMARY :

Brought to you by splunk. conf2017 in Washington DC, the nations capital, There's a lot of heat in this city. and how that operation comes into play when you have of disease reagents that cause their morphology to change, Walk us through that if you would. We tend to work on the order of multiples And the mission, as we understand it, you're looking instead of just a single feature in the image. and you guys are trying to break it. What is your experience been? at images of thousands of cells, versus images and it's highly accurate, presumably. My focus is on making the images as consistent So paint a picture of your data stack, if you will. So at the beginning of our stack we've got biological agents So by looking at the data that we're generating, And then do you have an image database for ...? And the image database is incredibly large. So right now all of our image data is stored on AWS. that the subject matter kind of trumps the technology and really that's the billion dollar question, Is it like survive until you thrive? in the next 10 years. How do you envision, or do you envision, and bringing that all into splunk is on the list. We're really at the cutting edge of technology right now, that have not been discovered yet. to have some kind of ongoing model evolution. To me that's the future of what we want to do. Did you look at maybe rolling your own with open source? and searching network drives to pull log files together. You were building splunk. and I could get off the ground so that everybody in the lab can go in and view it I saw some Wall Street concerns about the pricing. is exponentially growing, but the operational parameter Dave: You're not paying for the images, and see the progress. good to have you here.

ENTITIES

Entity	Category	Confidence
Dave	PERSON	0.99+
David	PERSON	0.99+
Ben Miller	PERSON	0.99+
Salt Lake City	LOCATION	0.99+
two parts	QUANTITY	0.99+
Washington DC	LOCATION	0.99+
thousands	QUANTITY	0.99+
Amazon	ORGANIZATION	0.99+
Ben	PERSON	0.99+
AWS	ORGANIZATION	0.99+
hundreds	QUANTITY	0.99+
billion dollar	QUANTITY	0.99+
Amazon Web Services	ORGANIZATION	0.99+
First	QUANTITY	0.99+
One	QUANTITY	0.99+
Facebook	ORGANIZATION	0.99+
each image	QUANTITY	0.98+
Recursion Pharmaceuticals	ORGANIZATION	0.98+
today	DATE	0.98+
both	QUANTITY	0.98+
one	QUANTITY	0.98+
100 different diseases	QUANTITY	0.97+
thousands of cells	QUANTITY	0.97+
Walter Washington Convention Center	LOCATION	0.97+
thousands of dimensions	QUANTITY	0.97+
over one billion dollars	QUANTITY	0.97+
about 2 1/2 years	QUANTITY	0.96+
two	QUANTITY	0.96+
.conf2017	EVENT	0.95+
thousands of different features	QUANTITY	0.94+
thousands of cellular features	QUANTITY	0.94+
Up to 100	QUANTITY	0.92+
splunk	ORGANIZATION	0.92+
384 experiments per plate	QUANTITY	0.92+
6-10 terabytes	QUANTITY	0.92+
almost 90 degrees	QUANTITY	0.91+
Wall Street	LOCATION	0.9+
Perl	TITLE	0.9+
HTS	OTHER	0.89+
single feature	QUANTITY	0.89+
single person	QUANTITY	0.88+
Sumo Logic	ORGANIZATION	0.88+
hundreds of plates per week	QUANTITY	0.88+
hundreds of thousands of images per plate	QUANTITY	0.88+
one time	QUANTITY	0.84+
Splunk	EVENT	0.83+
last years	DATE	0.82+
Covering	EVENT	0.82+
one person	QUANTITY	0.82+
ELK Stack	ORGANIZATION	0.82+
thousands of different features	QUANTITY	0.81+
hours	QUANTITY	0.78+
next 10 years	DATE	0.74+
S3	TITLE	0.73+
Machine Learning Toolkit	TITLE	0.67+
theCube	ORGANIZATION	0.65+
.tif	OTHER	0.63+
those	QUANTITY	0.62+
splunk	TITLE	0.6+
operational parameters	QUANTITY	0.6+
15-20 years	DATE	0.59+
less	QUANTITY	0.52+
.conf 2017	EVENT	0.52+
couple	QUANTITY	0.45+
theCube	COMMERCIAL_ITEM	0.45+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for Global Data Federation: