A Day in the Life of a Data Scientist

>>Hello, everyone. Welcome to the a day in the life of a data science talk. Uh, my name is Terry Chang. I'm a data scientist for the ASML container platform team. And with me, I have in the chat room, they will be moderating the chat. I have Matt MCO as well as Doug Tackett, and we're going to dive straight into kind of what we can do with the asthma container platform and how we can support the role of a data scientist. >>So just >>A quick agenda. So I'm going to do some introductions and kind of set the context of what we're going to talk about. And then we're actually going to dive straight into the ASML container platforms. So we're going to walk straight into what a data scientist will do, kind of a pretty much a day in the life of the data scientists. And then we'll have some question and answer. So big data has been the talk within the last few years within the last decade or so. And with big data, there's a lot of ways to derive meaning. And then a lot of businesses are trying to utilize their applications and trying to optimize every decision with their, uh, application utilizing data. So previously we had a lot of focus on data analytics, but recently we've seen a lot of data being used for machine learning. So trying to take any data that they can and send it off to the data scientists to start doing some modeling and trying to do some prediction. >>So that's kind of where we're seeing modern businesses rooted in analytics and data science in itself is a team sport. We're seeing that it doesn't, we need more than data scientists to do all this modeling. We need data engineers to take the data, massage the data and do kind of some data manipulation in order to get it right for the data scientists. We have data analysts who are monitoring the models, and we even have the data scientists themselves who are building and iterating through multiple different models until they find a one that is satisfactory to the business needs. Then once they're done, they can send it off to the software engineers who will actually build it out into their application, whether it's a mobile app or a web app. And then we have the operations team kind of assigning the resources and also monitoring it as well. >>So we're really seeing data science as a team sport, and it does require a lot of different expertise and here's the kind of basic machine learning pipeline that we see in the industry now. So, uh, at the top we have this training environment and this is, uh, an entire loop. Uh, we'll have some registration, we'll have some inferencing and at the center of all, this is all the data prep, as well as your repositories, such as for your data, for any of your GitHub repository, things of that sort. So we're kind of seeing the machine learning industry, go follow this very basic pattern and at a high level I'll glance through this very quickly, but this is kind of what the, uh, machine learning pipeline will look like on the ASML container platform. So at the top left, we'll have our, our project depository, which is our, uh, persistent storage. >>We'll have some training clusters, we'll have a notebook, we'll have an inference deployment engine and a rest API, which is all sitting on top of the Kubernetes cluster. And the benefit of the container platform is that this is all abstracted away from the data scientist. So I will actually go straight into that. So just to preface, before we go into the data as small container platform, where we're going to look at is a machine learning example, problem that is, uh, trying to predict how long a specific taxi ride will take. So with a Jupiter notebook, the data scientists can take all of this data. They can do their data manipulation, train a model on a specific set of features, such as the location of a taxi ride, the duration of a taxi ride, and then model it to trying to figure out, you know, what, what kind of prediction we can get on a future taxi ride. >>So that's the example that we will talk through today. I'm going to hop out of my slides and jump into my web browser. So let me zoom in on this. So here I have a Jupiter environment and, um, this is all running on the container platform. All I need is actually this link and I can access my environment. So as a data scientist, I can grab this link from my it admin or my system administrator. And I could quickly start iterating and, and start coding. So on the left-hand side of the Jupiter, we actually have a file directory structure. So this is already synced up to my get repository, which I will show in a little bit on the container platform so quickly I can pull any files that are on my get hub repository. I can even push with a button here, but I can, uh, open up this Python notebook. >>And with all this, uh, unique features of the Jupiter environment, I can start coding. So each of these cells can run Python code and in specific the container at the ESMO container platform team, we've actually built our own in-house lime magic commands. So these are unique commands, um, that we can use to interact with the underlying infrastructure of the container platform. So the first line magic command that I want to mention is this command called percent attachments. When I run this command, I'll actually get the available training clusters that I can send training jobs to. So this specific notebook, uh, it's pretty much been created for me to quickly iterate and develop a model very quickly. I don't have to use all the resources. I don't have to allocate a full set of GPU boxes onto my little Jupiter environment. So with the training cluster, I can attach these individual data science notebooks to those training clusters and the data scientists can actually utilize those resources as a shared environment. >>So the, essentially the shared large eight GPU box can actually be shared. They don't have to be allocated to a single data scientist moving on. We have another line magic command, it's called percent percent Python training. This is how we're going to utilize that training cluster. So I will prepare the cell percent percent with the name of the training cluster. And this is going to tell this notebook to send this entire training cell, to be trained on those resources on that training cluster. So the data scientists can quickly iterate through a model. They can then format that model and all that code into a large cell and send it off to that training cluster. So because of that training cluster is actually located somewhere else. It has no context of what has been done locally in this notebook. So we're going to have to do and copy everything into one large cell. >>So as you see here, I'm going to be importing some libraries and I'm in a, you know, start defining some helper functions. I'm going to read in my dataset and with the typical data science modeling life cycle, we're going to have to take in the data. We're going to have to do some data pre-processing. So maybe the data scientists will do this. Maybe the data engineer will do this, but they have access to that data. So I'm here. I'm actually getting there to be reading in the data from the project repository. And I'll talk about this a little bit later with all of the clusters within the container platform, we have access to some project repository that has been set up using the underlying data fabric. So with this, I have, uh, some data preprocessing, I'm going to cleanse some of my data that I noticed that maybe something is missing or, uh, some data doesn't look funky. >>Maybe the data types aren't correct. This will all happen here in these cells. So once that is done, I can print out that the data is done cleaning. I can start training my model. So here we have to split our data, set into a test, train, uh, data split so that we have some data for actually training the model and some data to test the model. So I can split my data there. I could create my XG boost object to start doing my training and XG boost is kind of like a decision tree machine learning algorithm, and I'm going to fit my data into this, uh, XG boost algorithm. And then I'm going to do some prediction. And then in addition, I'm actually going to be tracking some of the metrics and printing them out. So these are common metrics that we, that data scientists want to see when they do their training of the algorithm. >>Just to see if some of the accuracy is being improved, if the loss is being improved or the mean absolute error. So things like that. So these are all things, data scientists want to see. And at the end of this training job, I'm going to be saving the model. So I'm going to be saving it back into the project repository in which we will have access to. And at the end, I will print out the end time so I can execute that cell. And I've already executed that cell. So you'll see all of these print statements happening here. So importing the libraries, the training was run reading and data, et cetera. All of this has been printed out from that training job. Um, and in order to access that, uh, kind of glance through that, we would get an output with a unique history URL. >>So when we send the training job to that training cluster, we'll the training cluster will send back a unique URL in which we'll use the last line magic command that I want to talk about called percent logs. So percent logs will actually, uh, parse out that response from the training cluster. And actually we can track in real time what is happening in that training job so quickly, we can see that the data scientist has a sandbox environment available to them. They have access to their get repository. They have access to a project repository in which they can read in some of their data and save the model. So very quick interactive environment for the data scientists to do all of their work. And it's all provisioned on the ASML container platform. And it's also abstracted away. So here, um, I want to mention that again, this URL is being surfaced through the container platform. >>The data scientist doesn't have to interact with that at all, but let's take, it's take a step back. Uh, this is the day to day in the life of the data scientists. Now, if we go backwards into the container platform and we're going to walk through how it was all set up for them. So here is my login page to the container platform. I'm going to log in as my user, and this is going to bring me to the, uh, view of the, uh, Emma lops tenant within the container platform. So this is where everything has been set up for me, the data scientist doesn't have to see this if they don't need to, but what I'll walk through now is kind of the topics that I mentioned previously that we would go back into. So first is the project repository. So this project deposited comes with each tenant that is created on the platform. >>So this is a more, nothing more than a shared collaborative workspace environment in which data scientist or any data scientist who is allocated to this tenant. They have this politics client that can visually see all their data of all, all of their code. And this is actually taking a piece of the underlying data fabric and using that for your project depository. So you can see here, I have some code I can create and see my scoring script. I can see the models that have been created within this tenant. So it's pretty much a powerful tool in which you can store your code store any of your data and have the ability to read and write from any of your Jupiter environments or any of your created clusters within this tenant. So a very cool ad here in which you can, uh, quickly interact with your data. >>The next thing I want to show is the source control. So here is where you would plug in all of your information for your source control. And if I edit this, you guys will actually see all the information that I've passed in to configure the source control. So on the backend, the container platform will take these credentials and connect the Jupiter notebooks you create within this tenant to that get repository. So this is the information that I've passed in. If GitHub is not of interest, we also have support for bit bucket here as well. So next I want to show you guys that we do have these notebook environments. So, um, the notebook environment was created here and you can see that I have a notebook called Teri notebook, and this is all running on the Kubernetes environment within the container platform. So either the data scientists can come here and create their notebook or their project admin can create the notebook. >>And all you'd have to do is come here to this notebook end points. And this, the container platform will actually map the container platform to a specific port in which you can just give this link to the data scientists. And this link will actually bring them to their own Jupiter environment and they can start doing all of their model just as I showed in that previous Jupiter environment. Next I want to show the training cluster. This is the training cluster that was created in which I can attach my notebook to start utilizing those training clusters. And then the last thing I want to show is the model, the deployment cluster. So once that model has been saved, we have a model registry in which we can register the model into the platform. And then the last step is to create a deployment clusters. So here on my screen, I have a deployment cluster called taxi deployment. >>And then all these serving end points have been configured for me. And most importantly, this endpoint model. So the deployment cluster is actually a wrap the, uh, train model with the flask wrapper and add a rest endpoint to it so quickly. I can operationalize my model by taking this end point and creating a curl command, or even a post request. So here I have my trusty postman tool in which I can format a post request. So I've taken that end point from the container platform. I've formatted my body, uh, right here. So these are some of the features that I want to send to that model. And I want to know how long this specific taxi ride at this location at this time of day would take. So I can go ahead and send that request. And then quickly I will get an output of the ride. >>Duration will take about 2,600 seconds. So pretty much we've walked through how a data scientists can quickly interact with their notebook. They can train their model. And then coming into the platform, we saw the project repository, we saw the source control. We can register the model within the platform, and then quickly we can operationalize that model with our deployment cluster, uh, and have our model up and running and available for inference. So that wraps up the demo. Uh, I'm gonna pass it back to Doug and Matt and see if they want to come off mute and see if there are any questions, Matt, Doug, you there. Okay. >>Yeah. Hey, Hey Terry, sorry. Sorry. Just had some trouble getting off mute there. Uh, no, that was a, that was an excellent presentation. And I think there are generally some questions that come up when I talk to customers around how integrated into the Kubernetes ecosystem is this capability and where does this sort of Ezreal starts? And the open source, uh, technologies like, um, cube flow as an example, uh, begin. >>Yeah, sure. Matt. So this is kind of one layer up. We have our Emma LOBs tenant and this is all running on a piece of a Kubernetes cluster. So if I log back out and go into the site admin view, this is where you would see all the Kubernetes clusters being created. And it's actually all abstracted away from the data scientists. They don't have to know Kubernetes. They just interact with the platform if they want to. But here in the site admin view, I had this Kubernetes dashboard and here on the left-hand side, I have all my Kubernetes sections. So if I just add some compute hosts, whether they're VMs or cloud compute hosts, like ETQ hosts, we can have these, uh, resources abstracted away from us to then create a Kubernetes cluster. So moving on down, I have created this Kubernetes cluster utilizing those resources. >>Um, so if I go ahead and edit this cluster, you'll actually see that have these hosts, which is just a click and a click and drop method. I can move different hosts to then configure my Kubernetes cluster. Once my Kubernetes cluster is configured, I can then create Kubernetes tenant or in this case, it's a namespace. So once I have this namespace available, I can then go into that tenant. And as my user, I don't actually see that it is running on Kubernetes. So in addition with our ML ops tenants, you have the ability to bootstrap cute flow. So queue flow is a open source machine learning framework that is run on Kubernetes, and we have the ability to link that up as well. So, uh, coming back to my Emma lops tenant, I can log in what I showed is the ASML container platform version of Emma flops. But you see here, we've also integrated QP flow. So, uh, very, uh, a nod to, uh, HPS contribution to, you know, utilizing open source. Um, it's actually all configured within our platform. So, um, hopefully, >>Yeah, actually, Tara, can you hear me? It's Doug. So there were a couple of other questions actually about key flare that came in. I wonder whether you could just comment on why we've chosen cube flow. Cause I know there was a question about ML flow in stead and what the differences between ML flow and coop flow. >>Yeah, sure. So the, just to reiterate, there are some questions about QP flow and I'm just, >>Yeah, so obviously one of, uh, one of the people watching saw the queue flow dashboard there, I guess. Um, and so couldn't help but get excited about it. But there was another question about whether, you know, ML flow versus cube flow and what the difference was between them. >>Yeah. So with flow, it's, it's an open source framework that Google has developed. It's a very powerful framework that comes with a lot of other unique tools and Kubernetes. So with Q flow, you really have the ability to launch other notebooks. You have the ability to utilize different Kubernetes operators like TensorFlow and PI torch. You can utilize a lot of the, some of the frameworks within Q4 to do training like Q4 pipelines, which visually allow you to see your training jobs, uh, within the queue flow. It also has a plethora of different serving mechanisms, such as Seldin, uh, for, you know, deploying your, your machine learning models. You have Ks serving, you have TF serving. So Q4 is very, it's a very powerful tool for data scientists to utilize if they want a full end to end open source and know how to use Kubernetes. So it's just a, another way to do your machine learning model development and right with ML flow, it's actually a different piece of the machine learning pipeline. So ML flow mainly focuses on model experimentation, comparing different models, uh, during the training and it off it can be used with Q4. >>The complimentary Terry I think is what you're saying. Sorry. I know we are dramatically running out of time now. So that was really fantastic demo. Thank you very much, indeed. >>Exactly. Thank you. So yeah, I think that wraps it up. Um, one last thing I want to mention is there is this slide that I want to show in case you have any other questions, uh, you can visit hp.com/asml, hp.com/container platform. If you have any questions and that wraps it up. So thank you guys.

Published Date : Mar 17 2021

SUMMARY :

I'm a data scientist for the ASML container platform team. So I'm going to do some introductions and kind of set the context of what we're going to talk about. the models, and we even have the data scientists themselves who are building and iterating So at the top left, we'll have our, our project depository, which is our, And the benefit of the container platform is that this is all abstracted away from the data scientist. So that's the example that we will talk through today. So the first line magic command that I want to mention is this command called percent attachments. So the data scientists can quickly iterate through a model. So maybe the data scientists will do this. So once that is done, I can print out that the data is done cleaning. So I'm going to be saving it back into the project repository in which we will So here, um, I want to mention that again, this URL is being So here is my login page to the container So this is a more, nothing more than a shared collaborative workspace environment in So on the backend, the container platform will take these credentials and connect So once that model has been saved, we have a model registry in which we can register So I've taken that end point from the container platform. So that wraps up the demo. And the open source, uh, technologies like, um, cube flow as an example, So moving on down, I have created this Kubernetes cluster So once I have this namespace available, So there were a couple of other questions actually So the, just to reiterate, there are some questions about QP flow and I'm just, But there was another question about whether, you know, ML flow versus cube flow and So with Q flow, you really have the ability to launch So that was really fantastic demo. So thank you guys.

ENTITIES

Entity	Category	Confidence
Doug	PERSON	0.99+
Doug Tackett	PERSON	0.99+
Terry Chang	PERSON	0.99+
Terry	PERSON	0.99+
Tara	PERSON	0.99+
Matt	PERSON	0.99+
Python	TITLE	0.99+
Google	ORGANIZATION	0.99+
Matt MCO	PERSON	0.99+
Jupiter	LOCATION	0.99+
Kubernetes	TITLE	0.99+
first line	QUANTITY	0.98+
each	QUANTITY	0.98+
GitHub	ORGANIZATION	0.98+
today	DATE	0.98+
first	QUANTITY	0.98+
about 2,600 seconds	QUANTITY	0.97+
Q4	TITLE	0.97+
A Day in the Life of a Data Scientist	TITLE	0.97+
hp.com/asml	OTHER	0.97+
last decade	DATE	0.97+
one layer	QUANTITY	0.95+
hp.com/container	OTHER	0.92+
single data	QUANTITY	0.91+
Emma	PERSON	0.91+
one large cell	QUANTITY	0.91+
each tenant	QUANTITY	0.88+
one	QUANTITY	0.84+
one last thing	QUANTITY	0.81+
Q flow	TITLE	0.8+
Emma	TITLE	0.8+
ESMO	ORGANIZATION	0.76+
last few years	DATE	0.74+
one of	QUANTITY	0.73+
day	QUANTITY	0.72+
eight GPU	QUANTITY	0.7+
Seldin	TITLE	0.69+
Q4	DATE	0.67+
percent percent	OTHER	0.65+
Ezreal	ORGANIZATION	0.65+
some questions	QUANTITY	0.65+
ASML	TITLE	0.65+
ASML	ORGANIZATION	0.61+
people	QUANTITY	0.49+
ETQ	TITLE	0.46+
Teri	ORGANIZATION	0.4+
Emma	ORGANIZATION	0.35+

Recommend Videos

Sentiment Analysis

AWS Comprehend

Search Results for A Day in the Life of a Data Scientist: