1.6 C
New York
Saturday, January 18, 2025

Speed up base mannequin coaching and tuning with new Amazon SageMaker HyperPod recipes


At present we announce the overall availability of Amazon SageMaker HyperPod Recipes to assist knowledge scientists and builders of all talent units begin coaching and honing basis fashions (FM) in minutes with next-generation efficiency. They will now entry optimized recipes for coaching and perfecting standard publicly out there FMs, corresponding to Name 3.1 405B, Flame 3.2 90Bboth Mixtral 8x22B.

At AWS re:Invent 2023, launched SageMaker HyperPod to scale back FM coaching time by as much as 40 % and scale throughout over a thousand parallel computing sources with pre-configured distributed coaching libraries. With SageMaker HyperPod, you will discover the accelerated compute sources wanted for coaching, create essentially the most optimum coaching plans, and run coaching workloads in several capability blocks based mostly on compute useful resource availability.

SageMaker HyperPod recipes embrace an AWS-tested coaching stack, eliminating the tedious work of experimenting with totally different mannequin configurations and eliminating weeks of iterative testing and analysis. Recipes automate a number of vital steps, corresponding to loading coaching knowledge units, making use of distributed coaching methods, automating checkpoints for sooner restoration from failures, and managing the end-to-end coaching cycle.

With a easy recipe change, you may seamlessly swap between GPU- or Trainium-based cases to additional optimize coaching efficiency and cut back prices. You may simply run manufacturing workloads on SageMaker HyperPod or SageMaker coaching jobs.

SageMaker HyperPod Recipes in Motion
To get began, go to the SageMaker HyperPod Recipes GitHub Repository to seek for standard publicly out there FM coaching recipes.

You solely have to edit easy recipe parameters to specify an occasion kind and the situation of your knowledge set within the cluster configuration, then run the recipe with a single-line command to realize state-of-the-art efficiency.

You should edit the config.yaml recipe file to specify the mannequin and cluster kind after cloning the repository.

$ git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
$ cd sagemaker-hyperpod-recipes
$ pip3 set up -r necessities.txt.
$ cd ./recipes_collections
$ vim config.yaml

Recipe assist SageMaker HyperPod with Slurm, SageMaker HyperPod with Amazon Elastic Kubernetes Service (Amazon EKS)and SageMaker Coaching Jobs. For instance, you may configure a cluster kind (Slurm orchestrator), a mannequin title (Meta Llama 3.1 405B language mannequin), an occasion kind (ml.p5.48xlarge) and the places of your knowledge, corresponding to storing coaching knowledge, outcomes, logs, and so forth.

defaults:
- cluster: slurm # assist: slurm / k8s / sm_jobs
- recipes: fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora # title of mannequin to be educated
debug: False # set to True to debug the launcher configuration
instance_type: ml.p5.48xlarge # or different supported cluster cases
base_results_dir: # Location(s) to retailer the outcomes, checkpoints, logs and so forth.

You may optionally tune model-specific coaching parameters on this YAML file, which describes the optimum configuration, together with the variety of accelerator units, occasion kind, coaching accuracy, parallelization and chunking methods, optimizer, and logging to watch experiments throughout Pressure Board.

run:
  title: llama-405b
  results_dir: ${base_results_dir}/${.title}
  time_limit: "6-00:00:00"
restore_from_path: null
coach:
  units: 8
  num_nodes: 2
  accelerator: gpu
  precision: bf16
  max_steps: 50
  log_every_n_steps: 10
  ...
exp_manager:
  exp_dir: # location for TensorBoard logging
  title: helloworld 
  create_tensorboard_logger: True
  create_checkpoint_callback: True
  checkpoint_callback_params:
    ...
  auto_checkpoint: True # for automated checkpointing
use_smp: True 
distributed_backend: smddp # optimized collectives
# Begin coaching from pretrained mannequin
mannequin:
  model_type: llama_v3
  train_batch_size: 4
  tensor_model_parallel_degree: 1
  expert_model_parallel_degree: 1
  # different model-specific params

To run this recipe on SageMaker HyperPod with Slurm, you need to put together the SageMaker HyperPod cluster by following the directions cluster configuration assertion.

Then, connect with the SageMaker HyperPod grasp node, entry the Slurm controller, and replica the edited recipe. It then runs a helper file to generate a Slurm submission script for the job that you should use for a take a look at to examine the content material earlier than beginning the coaching job.

$ python3 major.py --config-path recipes_collection --config-name=config

After coaching is full, the educated mannequin is mechanically saved to the assigned knowledge location.

To run this recipe on SageMaker HyperPod with Amazon EKS, clone the recipe from the GitHub repository, set up the necessities, and edit the recipe (cluster: k8s) in your laptop computer. Subsequent, create a hyperlink between your laptop computer and the EKS cluster operating and subsequently use the HyperPod Command Line Interface (CLI) to execute the recipe.

$ hyperpod start-job –recipe fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora 
--persistent-volume-claims fsx-claim:knowledge 
--override-parameters 
'{
  "recipes.run.title": "hf-llama3-405b-seq8k-gpu-qlora",
  "recipes.exp_manager.exp_dir": "/knowledge/",
  "cluster": "k8s",
  "cluster_type": "k8s",
  "container": "658645717510.dkr.ecr..amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121",
  "recipes.mannequin.knowledge.train_dir": "",
  "recipes.mannequin.knowledge.val_dir": "",
}'

You may also run recipes in SageMaker coaching jobs utilizing SageMaker Python SDK. The next instance runs PyTorch coaching scripts in SageMaker coaching jobs with overridden coaching recipes.

...
recipe_overrides = {
    "run": {
        "results_dir": "/decide/ml/mannequin",
    },
    "exp_manager": {
        "exp_dir": "",
        "explicit_log_dir": "/decide/ml/output/tensorboard",
        "checkpoint_dir": "/decide/ml/checkpoints",
    },   
    "mannequin": {
        "knowledge": {
            "train_dir": "/decide/ml/enter/knowledge/practice",
            "val_dir": "/decide/ml/enter/knowledge/val",
        },
    },
}
pytorch_estimator = PyTorch(
           output_path=,
           base_job_name=f"llama-recipe",
           function=,
           instance_type="p5.48xlarge",
           training_recipe="fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora",
           recipe_overrides=recipe_overrides,
           sagemaker_session=sagemaker_session,
           tensorboard_output_config=tensorboard_output_config,
)
...

As coaching progresses, mannequin checkpoints are saved in Amazon Easy Storage Service (Amazon S3) with the potential of totally automated checkpointing, enabling sooner restoration from coaching failures and occasion restarts.

Now out there
Amazon SageMaker HyperPod recipes at the moment are out there at SageMaker HyperPod Recipes GitHub Repository. For extra data, go to the SageMaker HyperPod Product Web page and the Amazon SageMaker AI Developer Information.

Attempt SageMaker HyperPod recipes and ship your suggestions to AWS re: Publishing for SageMaker or via your typical AWS Help contacts.

chany



Related Articles

Latest Articles