Right this moment we’re happy to announce Amazon Elastic Kubernetes Service (EKS) help on Amazon SageMaker HyperPod — specifically designed infrastructure bearing in mind resilience for the event of the bottom mannequin (FM). This new functionality permits prospects to prepare HyperPod clusters utilizing EKS, combining the facility of Kubernetes with Amazon SageMaker HyperPodSturdy surroundings designed for coaching massive fashions. Amazon SageMaker HyperPod helps you scale effectively on over a thousand synthetic intelligence (AI) accelerators, lowering coaching time by as much as 40%.
Amazon SageMaker HyperPod now permits prospects to handle their clusters utilizing a Kubernetes-based interface. This integration allows seamless switching between Slurm and Amazon EKS to optimize varied workloads, together with coaching, tuning, experimentation, and inference. The CloudWatch Observability EKS plugin offers complete monitoring capabilities, providing data on CPU, community, disk, and different low-level node metrics in a unified dashboard. This improved observability extends to cluster-wide useful resource utilization, node-level metrics, pod-level efficiency, and container-specific utilization knowledge, facilitating environment friendly optimization and troubleshooting.
Launched at re:Invent 2023, Amazon SageMaker HyperPod has turn out to be a go-to answer for AI startups and corporations seeking to effectively practice and deploy large-scale fashions. It’s appropriate with SageMaker Distributed Coaching Librarieswhat they provide Parallel mannequin and Parallel knowledge Software program optimizations that assist cut back coaching time by as much as 20%. SageMaker HyperPod robotically detects and repairs or replaces failed situations, permitting knowledge scientists to coach fashions with out interruption for weeks or months. This enables knowledge scientists to concentrate on growing fashions, reasonably than managing infrastructure.
The Amazon EKS integration with Amazon SageMaker HyperPod takes benefit of Kubernetes, which has turn out to be fashionable for machine studying (ML) workloads resulting from its scalability and wealthy open supply instruments. Organizations typically standardize on Kubernetes to construct purposes, together with these wanted for generative AI use instances, as a result of it allows reuse of capabilities throughout environments whereas assembly compliance and governance requirements. Right this moment’s announcement allows prospects to scale and optimize useful resource utilization throughout over a thousand AI accelerators. This flexibility improves the developer expertise, containerized software administration, and dynamic scaling for FM coaching and inference workloads.
Amazon EKS help in Amazon SageMaker HyperPod strengthens resiliency by deep well being checks, automated node restoration, and automated job resumption capabilities, making certain uninterrupted coaching for large-scale or long-duration jobs. Job administration could be optimized with the optionally available Hyperpod CLIdesigned for Kubernetes environments, though prospects may use their very own CLI instruments. Integration with Details about Amazon CloudWatch containers offers superior observability and provides deeper insights into cluster efficiency, well being, and utilization. Moreover, knowledge scientists can use instruments like Kubeflow for automated machine studying workflows. The mixing additionally consists of MLflow managed by Amazon SageMaker, offering a sturdy answer for experiment monitoring and mannequin administration.
At a excessive degree, the cloud administrator creates the Amazon SageMaker HyperPod cluster utilizing the HyperPod Cluster API and is totally managed by the HyperPod service, eliminating the heavy lifting and undifferentiated work concerned in constructing and optimizing ML infrastructure. Amazon EKS is used to orchestrate these HyperPod nodes, much like how Slurm orchestrates HyperPod nodes, offering prospects with a well-known Kubernetes-based admin expertise.
Let’s discover how one can get began with Amazon EKS help in Amazon SageMaker HyperPod
I begin by setting the scene, checking the conditionsand create an Amazon EKS cluster with a single AWS Cloud Coaching stack following the Amazon SageMaker HyperPod EKS Workshopconfigured with VPC and storage sources.
To create and handle Amazon SageMaker HyperPod clusters, I can use the AWS Administration Console both AWS Command Line Interface (AWS CLI). Utilizing the AWS CLI, I specify my cluster configuration in a JSON file. I select the Amazon EKS cluster created above because the orchestrator for the SageMaker HyperPod cluster. Then, I create the employee nodes of the cluster that I name “worker-group-1”, with a non-public Subnet,
NodeRecovery
begin to Computerized
to allow automated node restoration and for OnStartDeepHealthChecks
I add InstanceStress
and InstanceConnectivity
to permit in-depth well being checks.
cat > eli-cluster-config.json << EOL
{
"ClusterName": "example-hp-cluster",
"Orchestrator": {
"Eks": {
"ClusterArn": "${EKS_CLUSTER_ARN}"
}
},
"InstanceGroups": (
{
"InstanceGroupName": "worker-group-1",
"InstanceType": "ml.p5.48xlarge",
"InstanceCount": 32,
"LifeCycleConfig": {
"SourceS3Uri": "s3://${BUCKET_NAME}",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "${EXECUTION_ROLE}",
"ThreadsPerCore": 1,
"OnStartDeepHealthChecks": (
"InstanceStress",
"InstanceConnectivity"
),
},
....
),
"VpcConfig": {
"SecurityGroupIds": (
"$SECURITY_GROUP"
),
"Subnets": (
"$SUBNET_ID"
)
},
"ResilienceConfig": {
"NodeRecovery": "Computerized"
}
}
EOL
you possibly can add Occasion storage configurations to provision and mount an extra Amazon EBS volumes on HyperPod nodes.
To create the cluster utilizing the SageMaker HyperPod APII run the next AWS CLI command:
aws sagemaker create-cluster
--cli-input-json file://eli-cluster-config.json
The AWS command returns the ARN of the brand new HyperPod cluster.
{
"ClusterArn": "arn:aws:sagemaker:us-east-2:ACCOUNT-ID:cluster/wccy5z4n4m49"
}
I then examine the standing of the HyperPod cluster on the SageMaker Consoleready till the standing modifications to InService
.
And I can monitor cluster efficiency and well being metrics utilizing Details about Amazon CloudWatch containers.
Issues it’s best to know
Listed here are some key issues to find out about Amazon EKS help on Amazon SageMaker HyperPod:
Resilient surroundings – This integration offers a extra resilient coaching surroundings with deep well being checks, automated node restoration, and automated job resumption. SageMaker HyperPod robotically detects, diagnoses, and recovers from failures, permitting you to constantly practice fundamental fashions for weeks or months with out interruption. This will cut back coaching time by as much as 40%.
Improved GPU observability – Details about Amazon CloudWatch containers offers detailed metrics and logs to your containerized purposes and microservices. This allows complete monitoring of cluster efficiency and well being.
Scientist-friendly software – This launch features a customized HyperPod CLI for job administration, Kubeflow coaching operators for distributed coaching, Kueue for scheduling, and integration with SageMaker Managed MLflow for experiment monitoring. It additionally works with SageMaker’s distributed coaching libraries, which give model-parallel and data-parallel optimizations to considerably cut back coaching time. These libraries, mixed with automated job resumption, allow environment friendly and uninterrupted coaching of huge fashions.
Versatile useful resource utilization – This integration improves the developer expertise and scalability for FM workloads. Knowledge scientists can effectively share computing energy between coaching and inference duties. You should use your current Amazon EKS clusters or create and connect new ones to HyperPod compute, bringing your individual instruments for job submission, queuing, and monitoring.
To get began utilizing Amazon SageMaker HyperPod on Amazon EKS, you possibly can discover sources such because the SageMaker HyperPod EKS Workshophe aws-do-hyperpod undertakingand the Superior distributed coaching undertaking.. This model is usually out there in AWS Areas the place Amazon SageMaker HyperPod is out there, besides Europe (London). For pricing data, go to the Amazon SageMaker Pricing Web page.
This weblog put up was a collaborative effort. I wish to thank Manoj Ravi, Adhesh Garg, Tomonori Shimomura, Alex Iankoulski, Anoop Saha and your entire workforce for his or her vital contributions in compiling and refining the knowledge offered right here. Their collective expertise was essential within the creation of this complete article.
– Eli.