Ian Colle, AWS | SuperComputing 22
(lively music) >> Good morning. Welcome back to theCUBE's coverage at Supercomputing Conference 2022, live here in Dallas. I'm Dave Nicholson with my co-host Paul Gillin. So far so good, Paul? It's been a fascinating morning Three days in, and a fascinating guest, Ian from AWS. Welcome. >> Thanks, Dave. >> What are we going to talk about? Batch computing, HPC. >> We've got a lot, let's get started. Let's dive right in. >> Yeah, we've got a lot to talk about. I mean, first thing is we recently announced our batch support for EKS. EKS is our Kubernetes, managed Kubernetes offering at AWS. And so batch computing is still a large portion of HPC workloads. While the interactive component is growing, the vast majority of systems are just kind of fire and forget, and we want to run thousands and thousands of nodes in parallel. We want to scale out those workloads. And what's unique about our AWS batch offering, is that we can dynamically scale, based upon the queue depth. And so customers can go from seemingly nothing up to thousands of nodes, and while they're executing their work they're only paying for the instances while they're working. And then as the queue depth starts to drop and the number of jobs waiting in the queue starts to drop, then we start to dynamically scale down those resources. And so it's extremely powerful. We see lots of distributed machine learning, autonomous vehicle simulation, and traditional HPC workloads taking advantage of AWS Batch. >> So when you have a Kubernetes cluster does it have to be located in the same region as the HPC cluster that's going to be doing the batch processing, or does the nature of batch processing mean, in theory, you can move something from here to somewhere relatively far away to do the batch processing? How does that work? 'Cause look, we're walking around here and people are talking about lengths of cables in order to improve performance. So what does that look like when you peel back the cover and you look at it physically, not just logically, AWS is everywhere, but physically, what does that look like? >> Oh, physically, for us, it depends on what the customer's looking for. We have workflows that are all entirely within a single region. And so where they could have a portion of say the traditional HPC workflow, is within that region as well as the batch, and they're saving off the results, say to a shared storage file system like our Amazon FSx for Lustre, or maybe aging that back to an S3 object storage for a little lower cost storage solution. Or you can have customers that have a kind of a multi-region orchestration layer to where they say, "You know what? "I've got a portion of my workflow that occurs "over on the other side of the country "and I replicate my data between the East Coast "and the West Coast just based upon business needs. "And I want to have that available to customers over there. "And so I'll do a portion of it in the East Coast "a portion of it in the West Coast." Or you can think of that even globally. It really depends upon the customer's architecture. >> So is the intersection of Kubernetes with HPC, is this relatively new? I know you're saying you're, you're announcing it. >> It really is. I think we've seen a growing perspective. I mean, Kubernetes has been a long time kind of eating everything, right, in the enterprise space? And now a lot of CIOs in the industrial space are saying, "Why am I using one orchestration layer "to manage my HPC infrastructure and another one "to manage my enterprise infrastructure?" And so there's a growing appreciation that, you know what, why don't we just consolidate on one? And so that's where we've seen a growth of Kubernetes infrastructure and our own managed Kubernetes EKS on AWS. >> Last month you announced a general availability of Trainium, of a chip that's optimized for AI training. Talk about what's special about that chip or what is is customized to the training workloads. >> Yeah, what's unique about the Trainium, is you'll you'll see 40% price performance over any other GPU available in the AWS cloud. And so we've really geared it to be that most price performance of options for our customers. And that's what we like about the silicon team, that we're part of that Annaperna acquisition, is because it really has enabled us to have this differentiation and to not just be innovating at the software level but the entire stack. That Annaperna Labs team develops our network cards, they develop our ARM cards, they developed this Trainium chip. And so that silicon innovation has become a core part of our differentiator from other vendors. And what Trainium allows you to do is perform similar workloads, just at a lower price performance. >> And you also have a chip several years older, called Inferentia- >> Um-hmm. >> Which is for inferencing. What is the difference between, I mean, when would a customer use one versus the other? How would you move the workload? >> What we've seen is customers traditionally have looked for a certain class of machine, more of a compute type that is not as accelerated or as heavy as you would need for Trainium for their inference portion of their workload. So when they do that training they want the really beefy machines that can grind through a lot of data. But when you're doing the inference, it's a little lighter weight. And so it's a different class of machine. And so that's why we've got those two different product lines with the Inferentia being there to support those inference portions of their workflow and the Trainium to be that kind of heavy duty training work. >> And then you advise them on how to migrate their workloads from one to the other? And once the model is trained would they switch to an Inferentia-based instance? >> Definitely, definitely. We help them work through what does that design of that workflow look like? And some customers are very comfortable doing self-service and just kind of building it on their own. Other customers look for a more professional services engagement to say like, "Hey, can you come in and help me work "through how I might modify my workflow to "take full advantage of these resources?" >> The HPC world has been somewhat slower than commercial computing to migrate to the cloud because- >> You're very polite. (panelists all laughing) >> Latency issues, they want to control the workload, they want to, I mean there are even issues with moving large amounts of data back and forth. What do you say to them? I mean what's the argument for ditching the on-prem supercomputer and going all-in on AWS? >> Well, I mean, to be fair, I started at AWS five years ago. And I can tell you when I showed up at Supercomputing, even though I'd been part of this community for many years, they said, "What is AWS doing at Supercomputing?" I know you care, wait, it's Amazon Web Services. You care about the web, can you actually handle supercomputing workloads? Now the thing that very few people appreciated is that yes, we could. Even at that time in 2017, we had customers that were performing HPC workloads. Now that being said, there were some real limitations on what we could perform. And over those past five years, as we've grown as a company, we've started to really eliminate those frictions for customers to migrate their HPC workloads to the AWS cloud. When I started in 2017, we didn't have our elastic fabric adapter, our low-latency interconnect. So customers were stuck with standard TCP/IP. So for their highly demanding open MPI workloads, we just didn't have the latencies to support them. So the jobs didn't run as efficiently as they could. We didn't have Amazon FSx for Lustre, our managed lustre offering for high performant, POSIX-compliant file system, which is kind of the key to a large portion of HPC workloads is you have to have a high-performance file system. We didn't even, I mean, we had about 25 gigs of networking when I started. Now you look at, with our accelerated instances, we've got 400 gigs of networking. So we've really continued to grow across that spectrum and to eliminate a lot of those really, frictions to adoption. I mean, one of the key ones, we had a open source toolkit that was jointly developed by Intel and AWS called CFN Cluster that customers were using to even instantiate their clusters. So, and now we've migrated that all the way to a fully functional supported service at AWS called AWS Parallel Cluster. And so you've seen over those past five years we have had to develop, we've had to grow, we've had to earn the trust of these customers and say come run your workloads on us and we will demonstrate that we can meet your demanding requirements. And at the same time, there's been, I'd say, more of a cultural acceptance. People have gone away from the, again, five years ago, to what are you doing walking around the show, to say, "Okay, I'm not sure I get it. "I need to look at it. "I, okay, I, now, oh, it needs to be a part "of my architecture but the standard questions, "is it secure? "Is it price performant? "How does it compare to my on-prem?" And really culturally, a lot of it is, just getting IT administrators used to, we're not eliminating a whole field, right? We're just upskilling the people that used to rack and stack actual hardware, to now you're learning AWS services and how to operate within that environment. And it's still key to have those people that are really supporting these infrastructures. And so I'd say it's a little bit of a combination of cultural shift over the past five years, to see that cloud is a super important part of HPC workloads, and part of it's been us meeting the the market segment of where we needed to with innovating both at the hardware level and at the software level, which we're going to continue to do. >> You do have an on-prem story though. I mean, you have outposts. We don't hear a lot of talk about outposts lately, but these innovations, like Inferentia, like Trainium, like the networking innovation you're talking about, are these going to make their way into outposts as well? Will that essentially become this supercomputing solution for customers who want to stay on-prem? >> Well, we'll see what the future lies, but we believe that we've got the, as you noted, we've got the hardware, we've got the network, we've got the storage. All those put together gives you a a high-performance computer, right? And whether you want it to be redundant in your local data center or you want it to be accessible via APIs from the AWS cloud, we want to provide that service to you. >> So to be clear, that's not that's not available now, but that is something that could be made available? >> Outposts are available right now, that have this the services that you need. >> All these capabilities? >> Often a move to cloud, an impetus behind it comes from the highest levels in an organization. They're looking at the difference between OpEx versus CapEx. CapEx for a large HPC environment, can be very, very, very high. Are these HPC clusters consumed as an operational expense? Are you essentially renting time, and then a fundamental question, are these multi-tenant environments? Or when you're referring to batches being run in HPC, are these dedicated HPC environments for customers who are running batches against them? When you think about batches, you think of, there are times when batches are being run and there are times when they're not being run. So that would sort of conjure, in the imagination, multi-tenancy, what does that look like? >> Definitely, and that's been, let me start with your second part first is- >> Yeah. That's been a a core area within AWS is we do not see as, okay we're going to, we're going to carve out this super computer and then we're going to allocate that to you. We are going to dynamically allocate multi-tenant resources to you to perform the workloads you need. And especially with the batch environment, we're going to spin up containers on those, and then as the workloads complete we're going to turn those resources over to where they can be utilized by other customers. And so that's where the batch computing component really is powerful, because as you say, you're releasing resources from workloads that you're done with. I can use those for another portion of the workflow for other work. >> Okay, so it makes a huge difference, yeah. >> You mentioned, that five years ago, people couldn't quite believe that AWS was at this conference. Now you've got a booth right out in the center of the action. What kind of questions are you getting? What are people telling you? >> Well, I love being on the show floor. This is like my favorite part is talking to customers and hearing one, what do they love, what do they want more of? Two, what do they wish we were doing that we're not currently doing? And three, what are the friction points that are still exist that, like, how can I make their lives easier? And what we're hearing is, "Can you help me migrate my workloads to the cloud? "Can you give me the information that I need, "both from a price for performance, "for an operational support model, "and really help me be an internal advocate "within my environment to explain "how my resources can be operated proficiently "within the AWS cloud." And a lot of times it's, let's just take your application a subset of your applications and let's benchmark 'em. And really that, AWS, one of the key things is we are a data-driven environment. And so when you take that data and you can help a customer say like, "Let's just not look at hypothetical, "at synthetic benchmarks, let's take "actually the LS-DYNA code that you're running, perhaps. "Let's take the OpenFOAM code that you're running, "that you're running currently "in your on-premises workloads, "and let's run it on AWS cloud "and let's see how it performs." And then we can take that back to your to the decision makers and say, okay, here's the price for performance on AWS, here's what we're currently doing on-premises, how do we think about that? And then that also ties into your earlier question about CapEx versus OpEx. We have models where actual, you can capitalize a longer-term purchase at AWS. So it doesn't have to be, I mean, depending upon the accounting models you want to use, we do have a majority of customers that will stay with that OpEx model, and they like that flexibility of saying, "Okay, spend as you go." We need to have true ups, and make sure that they have insight into what they're doing. I think one of the boogeyman is that, oh, I'm going to spend all my money and I'm not going to know what's available. And so we want to provide the, the cost visibility, the cost controls, to where you feel like, as an HPC administrator you have insight into what your customers are doing and that you have control over that. And so once you kind of take away some of those fears and and give them the information that they need, what you start to see too is, you know what, we really didn't have a lot of those cost visibility and controls with our on-premises hardware. And we've had some customers tell us we had one portion of the workload where this work center was spending thousands of dollars a day. And we went back to them and said, "Hey, we started to show this, "what you were spending on-premises." They went, "Oh, I didn't realize that." And so I think that's part of a cultural thing that, at an HPC, the question was, well on-premises is free. How do you compete with free? And so we need to really change that culturally, to where people see there is no free lunch. You're paying for the resources whether it's on-premises or in the cloud. >> Data scientists don't worry about budgets. >> Wait, on-premises is free? Paul mentioned something that reminded me, you said you were here in 2017, people said AWS, web, what are you even doing here? Now in 2022, you're talking in terms of migrating to cloud. Paul mentioned outposts, let's say that a customer says, "Hey, I'd like you to put "in a thousand-node cluster in this data center "that I happen to own, but from my perspective, "I want to interact with it just like it's "in your data center." In other words, the location doesn't matter. My experience is identical to interacting with AWS in an AWS data center, in a CoLo that works with AWS, but instead it's my physical data center. When we're tracking the percentage of IT that's that is on-prem versus off-prem. What is that? Is that, what I just described, is that cloud? And in five years are you no longer going to be talking about migrating to cloud because people go, "What do you mean migrating to cloud? "What do you even talking about? "What difference does it make?" It's either something that AWS is offering or it's something that someone else is offering. Do you think we'll be at that point in five years, where in this world of virtualization and abstraction, you talked about Kubernetes, we should be there already, thinking in terms of it doesn't matter as long as it meets latency and sovereignty requirements. So that, your prediction, we're all about insights and supercomputing- >> My prediction- >> In five years, will you still be talking about migrating to cloud or will that be something from the past? >> In five years, I still think there will be a component. I think the majority of the assumption will be that things are cloud-native and you start in the cloud and that there are perhaps, an aspect of that, that will be interacting with some sort of an edge device or some sort of an on-premises device. And we hear more and more customers that are saying, "Okay, I can see the future, "I can see that I'm shrinking my footprint." And, you can see them still saying, "I'm not sure how small that beachhead will be, "but right now I want to at least say "that I'm going to operate in that hybrid environment." And so I'd say, again, the pace of this community, I'd say five years we're still going to be talking about migrations, but I'd say the vast majority will be a cloud-native, cloud-first environment. And how do you classify that? That outpost sitting in someone's data center? I'd say we'd still, at least I'll leave that up to the analysts, but I think it would probably come down as cloud spend. >> Great place to end. Ian, you and I now officially have a bet. In five years we're going to come back. My contention is, no we're not going to be talking about it anymore. >> Okay. >> And kids in college are going to be like, "What do you mean cloud, it's all IT, it's all IT." And they won't remember this whole phase of moving to cloud and back and forth. With that, join us in five years to see the result of this mega-bet between Ian and Dave. I'm Dave Nicholson with theCUBE, here at Supercomputing Conference 2022, day three of our coverage with my co-host Paul Gillin. Thanks again for joining us. Stay tuned, after this short break, we'll be back with more action. (lively music)
SUMMARY :
Welcome back to theCUBE's coverage What are we going to talk about? Let's dive right in. in the queue starts to drop, does it have to be of say the traditional HPC workflow, So is the intersection of Kubernetes And now a lot of CIOs in the to the training workloads. And what Trainium allows you What is the difference between, to be that kind of heavy to say like, "Hey, can you You're very polite. to control the workload, to what are you doing I mean, you have outposts. And whether you want it to be redundant that have this the services that you need. Often a move to cloud, to you to perform the workloads you need. Okay, so it makes a What kind of questions are you getting? the cost controls, to where you feel like, And in five years are you no And so I'd say, again, the not going to be talking of moving to cloud and back and forth.
SENTIMENT ANALYSIS :
ENTITIES
Entity | Category | Confidence |
---|---|---|
Ian | PERSON | 0.99+ |
Paul | PERSON | 0.99+ |
Dave Nicholson | PERSON | 0.99+ |
Paul Gillin | PERSON | 0.99+ |
Dave | PERSON | 0.99+ |
AWS | ORGANIZATION | 0.99+ |
400 gigs | QUANTITY | 0.99+ |
2017 | DATE | 0.99+ |
Ian Colle | PERSON | 0.99+ |
thousands | QUANTITY | 0.99+ |
Dallas | LOCATION | 0.99+ |
40% | QUANTITY | 0.99+ |
Amazon Web Services | ORGANIZATION | 0.99+ |
Amazon | ORGANIZATION | 0.99+ |
2022 | DATE | 0.99+ |
Annaperna | ORGANIZATION | 0.99+ |
second part | QUANTITY | 0.99+ |
five years | QUANTITY | 0.99+ |
Last month | DATE | 0.99+ |
Intel | ORGANIZATION | 0.99+ |
five years ago | DATE | 0.98+ |
five | QUANTITY | 0.98+ |
Two | QUANTITY | 0.98+ |
Supercomputing | ORGANIZATION | 0.98+ |
Lustre | ORGANIZATION | 0.97+ |
Annaperna Labs | ORGANIZATION | 0.97+ |
Trainium | ORGANIZATION | 0.97+ |
five years | QUANTITY | 0.96+ |
one | QUANTITY | 0.96+ |
OpEx | TITLE | 0.96+ |
both | QUANTITY | 0.96+ |
first thing | QUANTITY | 0.96+ |
Supercomputing Conference | EVENT | 0.96+ |
first | QUANTITY | 0.96+ |
West Coast | LOCATION | 0.96+ |
thousands of dollars a day | QUANTITY | 0.96+ |
Supercomputing Conference 2022 | EVENT | 0.95+ |
CapEx | TITLE | 0.94+ |
three | QUANTITY | 0.94+ |
theCUBE | ORGANIZATION | 0.92+ |
East Coast | LOCATION | 0.91+ |
single region | QUANTITY | 0.91+ |
years | QUANTITY | 0.91+ |
thousands of nodes | QUANTITY | 0.88+ |
Parallel Cluster | TITLE | 0.87+ |
about 25 gigs | QUANTITY | 0.87+ |