Develop and take a look at AWS GLUE 5.0 Jobs domestically utilizing a docker container

AWS GUE It’s a knowledge integration service that permits you to course of and combine knowledge which can be made via totally different scale knowledge sources. AWS GLUE 5.0, the newest AWS GLUE for Apache Spark Jobs model offers an Apache 3.5 Execution time expertise optimized for efficiency for heaps and transmissions processing. With AWS GLUE 5.0, you get improved efficiency, better safety, help for the following technology of Amazon SagemakerAnd extra. AWS GLUE 5.0 permits you to develop, run and climb your knowledge integration workloads and procure quicker info.

AWS GUE accommodates a number of growth preferences via a number of employment creation approaches. For builders preferring direct coding, the event of Python or Scala is offered utilizing the AWS GLUE ETL Library.

The creation of knowledge platforms prepared for manufacturing requires stable growth processes and steady integration and supply pipes (CI/CD). To help numerous growth wants, both in native machines, Docker Containers On Amazon’s elastic computing cloud (Amazon EC2) or different environments: AWS offers an official picture of AWS Gue Docker via the Amazon Ecr Public Gallery. The picture permits builders to work effectively of their most popular atmosphere whereas utilizing the AWS GLUE ETL Library.

On this publication, we present how you can develop and take a look at the work of AWS GUE 5.0 domestically utilizing a Docker container. This publication is an up to date model of the publication. Develop and take a look at the work of AWS GUE model 3.0 and 4.0 domestically utilizing a docker containerand use AWS GUE 5.0.

Docker photos obtainable

Docker’s following photos can be found for Amazon Ecr Public Gallery:

AWS GUE model 5.0 – ecr.aws/glue/aws-glue-libs:5

The Gue Aws Docker photos are appropriate with each x86_64 and arm64.

On this publication, we use public.ecr.aws/glue/aws-glue-libs:5 and run the container in a neighborhood machine (Mac, Home windows or Linux). This container picture has been examined for AWS GUE 5.0 Spark Jobs. The picture incorporates the next:

To configure your container, take away the picture of the ECR public gallery after which execute the container. We show how you can execute your container with the next strategies, relying in your necessities:

spark-submit
Duplicate shell (pyspark)
pytest
Visible Studio Code

Earlier necessities

Earlier than beginning, make sure that Docker is put in and Docker Daemon is working. For set up directions, see Docker’s documentation for Waterproof, Home windowsboth Linux. Additionally you should definitely have no less than 7 GB of disk area for the picture within the host that Docker executes.

Configure AWS credentials

To allow AWS API calls from the container, configure your AWS credentials with the next steps:

Create a profile referred to as AWS.
Open CMD in Home windows or a terminal in Mac/Linux, and run the next command:

PROFILE_NAME="profile_name"

Within the following sections, we use this profile referred to as AWS.

Pull the picture of the general public gallery of ECR

If Docker is working on Home windows, select the Docker icon (proper click on) and select Change to Linux’s containers Earlier than pulling the picture.

Execute the next command to extract the picture of the ECR public gallery:

docker pull public.ecr.aws/glue/aws-glue-libs:5

Execute the container

Now you’ll be able to run a container utilizing this picture. You’ll be able to select any of the next strategies primarily based in your necessities.

spark

You’ll be able to execute an AWS glue work script by executing the spark-submit Command within the container.

Write your work script (pattern.py within the following instance) and information it below the /local_path_to_workspace/src/ Listing utilizing the next instructions:

$ WORKSPACE_LOCATION=/local_path_to_workspace
$ SCRIPT_FILE_NAME=pattern.py
$ mkdir -p ${WORKSPACE_LOCATION}/src
$ vim ${WORKSPACE_LOCATION}/src/${SCRIPT_FILE_NAME}

These variables are used within the following docker run area. The pattern code (pattern.py) used within the spark-submit The command is included within the appendix on the finish of this publication.

Execute the next command to execute the spark-submit Command within the container to ship a brand new spark software:

$ docker run -it --rm 
    -v ~/.aws:/dwelling/hadoop/.aws 
    -v $WORKSPACE_LOCATION:/dwelling/hadoop/workspace/ 
    -e AWS_PROFILE=$PROFILE_NAME 
    --name glue5_spark_submit 
    public.ecr.aws/glue/aws-glue-libs:5 
    spark-submit /dwelling/hadoop/workspace/src/$SCRIPT_FILE_NAME

Shell Repler (PysPark)

You’ll be able to execute a replen (Lee-Eval-Hinting Loop) for interactive growth. Execute the next command to execute the Pyspark command within the container to begin the repl SHELL:

$ docker run -it --rm 
    -v ~/.aws:/dwelling/hadoop/.aws 
    -e AWS_PROFILE=$PROFILE_NAME 
    --name glue5_pyspark 
    public.ecr.aws/glue/aws-glue-libs:5 
    pyspark

You will note the next output:

Python 3.11.6 (most important, Jan  9 2025, 00:00:00) (GCC 11.4.1 20230605 (Purple Hat 11.4.1-2)) on linux
Sort "assist", "copyright", "credit" or "license" for extra info.
Setting default log stage to "WARN".
To regulate logging stage use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _ / _ / _ `/ __/  '_/
   /__ / .__/_,_/_/ /_/_   model 3.5.2-amzn-1
      /_/

Utilizing Python model 3.11.6 (most important, Jan  9 2025 00:00:00)
Spark context Internet UI obtainable at None
Spark context obtainable as 'sc' (grasp = native(*), app id = local-1740643079929).
SparkSession obtainable as 'spark'.
>>>

With this Shell REPL, you’ll be able to codify and take a look at interactively.

Pytest

For unit assessments, you should utilize pytest For GLUE GLUE SPARK work scripts.

Execute the next instructions for preparation:

$ WORKSPACE_LOCATION=/local_path_to_workspace
$ SCRIPT_FILE_NAME=pattern.py
$ UNIT_TEST_FILE_NAME=test_sample.py
$ mkdir -p ${WORKSPACE_LOCATION}/assessments
$ vim ${WORKSPACE_LOCATION}/assessments/${UNIT_TEST_FILE_NAME}

Now let’s invoke pytest sporting docker run:

$ docker run -i --rm 
    -v ~/.aws:/dwelling/hadoop/.aws 
    -v $WORKSPACE_LOCATION:/dwelling/hadoop/workspace/ 
    --workdir /dwelling/hadoop/workspace 
    -e AWS_PROFILE=$PROFILE_NAME 
    --name glue5_pytest 
    public.ecr.aws/glue/aws-glue-libs:5 
    -c "python3 -m pytest --disable-warnings"

When pytest He finishes executing unit assessments, his departure might be seen as the next:

============================= take a look at session begins ==============================
platform linux -- Python 3.11.6, pytest-8.3.4, pluggy-1.5.0
rootdir: /dwelling/hadoop/workspace
plugins: integration-mark-0.2.0
collected 1 merchandise

assessments/test_sample.py .                                                   (100%)

======================== 1 handed, 1 warning in 34.28s =========================

Visible Studio Code

To configure the container with the Visible Studio Code, full the next steps:

Set up the Visible Studio Code.
Set up Piton.
Set up Growth containers.
Open the work area folder within the Visible Studio Code.
Press Ctrl+Shift+p (Home windows/Linux) or Cmd+shift+p (Waterproof).
Get into Preferences: Open Workspace Settings (JSON).
Press Get into.
Enter after Json and hold it:

{
    "python.defaultInterpreterPath": "/usr/bin/python3.11",
    "python.evaluation.extraPaths": (
        "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip:/usr/lib/spark/python/:/usr/lib/spark/python/lib/",
    )
}

Now you’re able to configure the container.

Execute the Docker container:

$ docker run -it --rm 
    -v ~/.aws:/dwelling/hadoop/.aws 
    -v $WORKSPACE_LOCATION:/dwelling/hadoop/workspace/ 
    -e AWS_PROFILE=$PROFILE_NAME 
    --name glue5_pyspark 
    public.ecr.aws/glue/aws-glue-libs:5 
    pyspark

Begin Visible Studio Code.
Select Distant explorer Within the navigation panel.
Select the container ecr.aws/glue/aws-glue-libs:5 (Proper click on) and select Connect within the present window.

If the next dialog seems, select I perceive.

Open /dwelling/hadoop/workspace/.

Create an AWS GLUE PYSPark script and select Run.

It is best to see the profitable execution within the AWS GLUE PYSPark script.

Modifications between AWS GLUE 4.0 Docker picture and AWS GLUE 5.0

The next are necessary adjustments between AWS GLUE 4.0 and the picture of GLUE 5.0 DOCKER:

In AWS GLUE 5.0, there is just one container picture for batches and transmission. This differs from AWS GLUE 4.0, the place there was a picture for the lot and one other for transmission.
In AWS GLUE 5.0, the predetermined username of the container is Hadoop. In AWS GLUE 4.0, the predetermined username was Gue_user.
In AWS GUE 5.0, a number of extra libraries have been eradicated, together with Jupyterlab and Livy, of the picture. You’ll be able to set up them manually.
In AWS GLUE 5.0, all Iceberg, Hudi and Delta libraries are predetermined, and the atmosphere variable DATALAKE_FORMATS It’s now not vital. Till AWS GUE 4.0, the atmosphere variable DATALAKE_FORMATS It was used to specify if the precise desk format is loaded.

The earlier checklist is restricted to the Docker picture. For extra details about AWS GLUE 5.0 updates, see Introduce AWS GUE 5.0 for Apache Spark and Migrate Aws Gue for Spark Jobs A AWS GLUE Model 5.0.

Concerns

Word that the next traits will not be appropriate when the picture of the AWS glue container is used to develop domestically:

Conclusion

On this publication, we discover how AWS GLUE 5.0 Docker photos present a versatile foundation to develop and take a look at work scripts AWS in its favourite atmosphere. These photos, simply obtainable within the Amazon ECR public gallery, pace up the event course of by providing a constant and laptop computer atmosphere for the event of AWS GUE.

For extra info on how you can construct the top -to -end growth pipe, see Finish -to -end growth life cycle for knowledge engineers to create a knowledge integration pipe utilizing AWS GUE. We suggest that you simply discover these capabilities and share your experiences with the AWS neighborhood.

Appendix A: Pattern codes of AWS glue work

This appendix presents three totally different scripts as pattern codes of AWS glue work for proof functions. You should utilize any of them within the tutorial.

The next pattern code.py makes use of the AWS GLUE ETL Library with a Amazon easy storage service (Amazon S3) referred to as API. The code requires Amazon S3 permissions in AWS id and entry administration (AM). You will need to grant the coverage administered by IAM: AWS: IAM :: AWS: Coverage/Amazons3Readonlyaccess or Customized IAM coverage that permits you to make listbucket and getobject API referred to as for route S3.

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions


class GluePythonSampleTest:
    def __init__(self):
        params = ()
        if '--JOB_NAME' in sys.argv:
            params.append('JOB_NAME')
        args = getResolvedOptions(sys.argv, params)

        self.context = GlueContext(SparkContext.getOrCreate())
        self.job = Job(self.context)

        if 'JOB_NAME' in args:
            jobname = args('JOB_NAME')
        else:
            jobname = "take a look at"
        self.job.init(jobname, args)

    def run(self):
        dyf = read_json(self.context, "s3://awsglue-datasets/examples/us-legislators/all/individuals.json")
        dyf.printSchema()

        self.job.commit()


def read_json(glue_context, path):
    dynamicframe = glue_context.create_dynamic_frame.from_options(
        connection_type="s3",
        connection_options={
            'paths': (path),
            'recurse': True
        },
        format="json"
    )
    return dynamicframe


if __name__ == '__main__':
    GluePythonSampleTest().run()

The next test_sample.py code is a pattern for a unitary pattern.py take a look at:

The next test_sample.py code is a pattern for a unit take a look at of pattern.py:
import pytest
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
import sys
from src import pattern


@pytest.fixture(scope="module", autouse=True)
def glue_context():
    sys.argv.append('--JOB_NAME')
    sys.argv.append('test_count')

    args = getResolvedOptions(sys.argv, ('JOB_NAME'))
    context = GlueContext(SparkContext.getOrCreate())
    job = Job(context)
    job.init(args('JOB_NAME'), args)

Appendix B: Add JDBC controllers and Java libraries

So as to add a JDBC driver presently obtainable within the container, you’ll be able to create a brand new listing in your work area with the JAR recordsdata you want and arrange the listing /decide/spark/jars/ in it docker run area. JAR recordsdata present in /decide/spark/jars/ Throughout the container they’re routinely added to Spark Classpath and might be obtainable to be used throughout work execution.

For instance, you should utilize the next docker run Command so as to add JDBC controller bottles to a pyspark shell reproduction:

$ docker run -it --rm 
    -v ~/.aws:/dwelling/hadoop/.aws 
    -v $WORKSPACE_LOCATION:/dwelling/hadoop/workspace/ 
    -v $WORKSPACE_LOCATION/jars/:/decide/spark/jars/ 
    --workdir /dwelling/hadoop/workspace 
    -e AWS_PROFILE=$PROFILE_NAME 
    --name glue5_jdbc 
    public.ecr.aws/glue/aws-glue-libs:5 
    pyspark

As highlighted above, the customJdbcDriverS3Path The connection choice can’t be used to import a customized JDBC controller of Amazon S3 within the photos of AWS glue containers.

Appendix C: Add Livy and Jupyterlab

The picture of the AWS GUE 5.0 container doesn’t have Livy put in by default. You’ll be able to create a brand new container picture that extends the picture of the AWS GLUE 5.0 container as a base. The next Dockerfile demonstrates how Docker can prolong to incorporate extra elements you might want to enhance your growth and take a look at expertise.

To start, create a listing in your work station and place the Dockerfile.livy_jupyter File within the Board of Administrators:

$ mkdir -p $WORKSPACE_LOCATION/jupyterlab/
$ cd $WORKSPACE_LOCATION/jupyterlab/
$ vim Dockerfile.livy_jupyter

The next code is Dockerfile.livy_jupyter:

FROM public.ecr.aws/glue/aws-glue-libs:5 AS glue-base

ENV LIVY_SERVER_JAVA_OPTS="--add-opens java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/solar.nio.ch=ALL-UNNAMED --add-opens=java.base/solar.nio.cs=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED"

# Obtain Livy
ADD --chown=hadoop:hadoop https://dlcdn.apache.org/incubator/livy/0.8.0-incubating/apache-livy-0.8.0-incubating_2.12-bin.zip ./

# Set up and configure Livy
RUN unzip apache-livy-0.8.0-incubating_2.12-bin.zip && 
rm apache-livy-0.8.0-incubating_2.12-bin.zip && 
mv apache-livy-0.8.0-incubating_2.12-bin livy && 
mkdir -p livy/logs && 
cat <> livy/conf/livy.conf
livy.server.host = 0.0.0.0
livy.server.port = 8998
livy.spark.grasp = native
livy.repl.enable-hive-context = true
livy.spark.scala-version = 2.12
EOF && 
cat <> livy/conf/log4j.properties
log4j.rootCategory=INFO,console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.goal=System.err
log4j.appender.console.format=org.apache.log4j.PatternLayout
log4j.appender.console.format.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %mpercentn
log4j.logger.org.eclipse.jetty=WARN
EOF

# Switching to root person quickly to put in dev dependency packages
USER root 
RUN dnf replace -y && dnf set up -y krb5-devel gcc python3.11-devel
USER hadoop

# Set up SparkMagic and JupyterLab
RUN export PATH=$HOME/.native/bin:$HOME/livy/bin/:$PATH && 
printf "numpy<2nIPython<=7.14.0n" > /tmp/constraint.txt && 
pip3.11 --no-cache-dir set up --constraint /tmp/constraint.txt --user pytest boto==2.49.0 jupyterlab==3.6.8 IPython==7.14.0 ipykernel==5.5.6 ipywidgets==7.7.2 sparkmagic==0.21.0 jupyterlab_widgets==1.1.11 && 
jupyter-kernelspec set up --user $(pip3.11 --no-cache-dir present sparkmagic | grep Location | minimize -d" " -f2)/sparkmagic/kernels/sparkkernel && 
jupyter-kernelspec set up --user $(pip3.11 --no-cache-dir present sparkmagic | grep Location | minimize -d" " -f2)/sparkmagic/kernels/pysparkkernel && 
jupyter server extension allow --user --py sparkmagic && 
cat <> /dwelling/hadoop/.native/bin/entrypoint.sh
#!/usr/bin/env bash
mkdir -p /dwelling/hadoop/workspace/
livy-server begin
sleep 5
jupyter lab --no-browser --ip=0.0.0.0 --allow-root --ServerApp.root_dir=/dwelling/hadoop/workspace/ --ServerApp.token='' --ServerApp.password=''
EOF

# Setup Entrypoint script
RUN chmod +x /dwelling/hadoop/.native/bin/entrypoint.sh

# Add default SparkMagic Config
ADD --chown=hadoop:hadoop https://uncooked.githubusercontent.com/jupyter-incubator/sparkmagic/refs/heads/grasp/sparkmagic/example_config.json .sparkmagic/config.json

# Replace PATH var
ENV PATH=/dwelling/hadoop/.native/bin:/dwelling/hadoop/livy/bin/:$PATH

ENTRYPOINT ("/dwelling/hadoop/.native/bin/entrypoint.sh")

Execute the Docker Construct command to construct the picture:

docker construct 
    -t glue_v5_livy 
    --file $WORKSPACE_LOCATION/jupyterlab/Dockerfile.livy_jupyter 
    $WORKSPACE_LOCATION/jupyterlab/

When the picture compilation is accomplished, you should utilize the next Docker Run command to begin the newly constructed picture:

docker run -it --rm 
    -v ~/.aws:/dwelling/hadoop/.aws 
    -v $WORKSPACE_LOCATION:/dwelling/hadoop/workspace/ 
    -p 8998:8998 
    -p 8888:8888 
    -e AWS_PROFILE=$PROFILE_NAME 
    --name glue5_jupyter  
    glue_v5_livy

Appendix D: Add extra python libraries

On this part, we talk about including extra python libraries and putting in python packages utilizing

Python native libraries

So as to add native Python libraries, place them below a listing and assign the path to $EXTRA_PYTHON_PACKAGE_LOCATION:

$ docker run -it --rm 
    -v ~/.aws:/dwelling/hadoop/.aws 
    -v $WORKSPACE_LOCATION:/dwelling/hadoop/workspace/ 
    -v $EXTRA_PYTHON_PACKAGE_LOCATION:/dwelling/hadoop/workspace/extra_python_path/ 
    --workdir /dwelling/hadoop/workspace 
    -e AWS_PROFILE=$PROFILE_NAME 
    --name glue5_pylib 
    public.ecr.aws/glue/aws-glue-libs:5 
    -c 'export PYTHONPATH=/dwelling/hadoop/workspace/extra_python_path/:$PYTHONPATH; pyspark'

To validate that the route has been added PYTHONPATHcan confirm its existence in sys.path:

Python 3.11.6 (most important, Jan  9 2025, 00:00:00) (GCC 11.4.1 20230605 (Purple Hat 11.4.1-2)) on linux
Sort "assist", "copyright", "credit" or "license" for extra info.
Setting default log stage to "WARN".
To regulate logging stage use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _ / _ / _ `/ __/  '_/
   /__ / .__/_,_/_/ /_/_   model 3.5.2-amzn-1
      /_/

Utilizing Python model 3.11.6 (most important, Jan  9 2025 00:00:00)
Spark context Internet UI obtainable at None
Spark context obtainable as 'sc' (grasp = native(*), app id = local-1740719582296).
SparkSession obtainable as 'spark'.
>>> import sys
>>> "/dwelling/hadoop/workspace/extra_python_path" in sys.path
True

Python bundle set up with PIP

To put in pypi packages (or another artifact repository) utilizing PIP, you should utilize the next strategy:

docker run -it --rm 
    -v ~/.aws:/dwelling/hadoop/.aws 
    -v $WORKSPACE_LOCATION:/dwelling/hadoop/workspace/ 
    --workdir /dwelling/hadoop/workspace 
    -e AWS_PROFILE=$PROFILE_NAME 
    -e SCRIPT_FILE_NAME=$SCRIPT_FILE_NAME 
    --name glue5_pylib 
    public.ecr.aws/glue/aws-glue-libs:5 
    -c 'pip3 set up snowflake==1.0.5; spark-submit /dwelling/hadoop/workspace/src/$SCRIPT_FILE_NAME'

Concerning the authors

Vajiraya Subramanya He’s an engineer from the cloud of Mr. (ETL) in Aws Sydney specialised in Aws Gue. He’s captivated with serving to clients clear up issues associated to their ETL workload and implement the processing of scalable knowledge and AWS evaluation pipes. Outdoors work, he likes to bike and stroll lengthy along with his Ollie canine.

Noritaka Sekiyama He’s a most important architect of Large Knowledge within the AWS GLUE workforce. Work in Tokyo, Japan. He’s chargeable for constructing software program artifacts to assist clients. In his free time, he likes to journey a motorcycle along with his street bike.