Optimizing AWS Glue Costs: How to Reduce Expenses for Interactive Sessions?

I've been using an AWS Glue interactive Jupyter Notebook to write a script. This script reads JSON data from an S3 bucket, transforms the data types, and writes the output as a Parquet file back to S3.

After approximately 6-7 hours of testing in last 2 days, I was surprised to find that my AWS Glue usage cost me about $160. The cost breakdown is $0.44 per Data Processing Unit-Hour for Glue interactive sessions and job notebooks, totaling 365.766 DPU-Hours.

I'm wondering what might be causing such high costs with this script. Any advice on configuration parameters to minimize future costs would be greatly appreciated.

%idle_timeout 100
%glue_version 4.0
%worker_type G.1X
%number_of_workers 2

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)


# Create dynamicframe from JSON data from AWS S3, No glue catalog/crawler require for this method
connection_options = {
    "paths": ["s3://<s3-bucket-source>/"],
    "recurse": True
}

dynamic_frame = glueContext.create_dynamic_frame_from_options(
    connection_type="s3",
    connection_options=connection_options,
    format = "json"
)
# dynamic_frame.toDF().show()
dynamic_frame.toDF().printSchema()


json_dynamic_frame=dynamic_frame.resolveChoice(specs = [
    ('data.sensor.cpuCurrentLoad','cast:double'), 
    ('data.sensor.cpuLoadAverage_15','cast:double'), 
    ('data.sensor.cpuCurrentLoadSystem','cast:double'), 
    ('data.sensor.cpuCurrentLoadUser','cast:double'), 
    ('data.sensor.cpuLoadAverage_5','cast:double'),
    ('data.sensor.cpuCurrentLoadIdle','cast:double'),
    ('data.sensor.cpuLoadAverage_1','cast:double'),
    ('data.calculation.memoryFree_MB','cast:double'),
])
json_dynamic_frame.printSchema()

glueContext.write_dynamic_frame.from_options(
       frame = json_dynamic_frame,
       connection_type = "s3",
       connection_options = {"path": "s3://<S3-bucket-destination>/", "compression": "NONE"},
       format = "parquet")

Topics

Storage Analytics AWS Well-Architected Framework

Tags

Amazon Simple Storage Service AWS Glue Cost Optimization

Language

English

Pankesh Patel

asked 9 days ago443 views

1 Answer

Newest
Most votes
Most comments

Accepted Answer

1. Reduce Idle Timeout

The %idle_timeout setting in your script is currently set to 100 minutes. Reducing this value can help minimize the time your Glue resources are running without performing any tasks. For example, setting it to 10 or 20 minutes might be more cost-effective.

%idle_timeout 20

2. Optimize Worker Type and Number of Workers

The choice of worker type and the number of workers significantly impacts your costs. In your script, you are using G.1X worker type with 2 workers. Consider the following adjustments:

Worker Type: If your job does not require the memory and CPU of G.1X, you can switch to a smaller worker type like Standard. However, ensure that this does not adversely affect your job's performance.
Number of Workers: Start with 1 worker and scale up only if needed. This reduces the initial cost and helps you better understand the actual resource requirements of your job.

%worker_type Standard
%number_of_workers 1

3. Use AWS Glue Job Bookmarks

AWS Glue job bookmarks help keep track of the data that has already been processed. This can prevent reprocessing of the same data and save costs, especially useful if you are testing and re-running your script multiple times.

4. Efficient Data Handling

Read Data in Batches: If you have a large dataset, read the data in batches instead of loading it all at once. This can help manage memory usage and reduce the overall processing time.
Filter Data Early: Apply any filters or transformations early in the process to minimize the amount of data being processed.

5. Optimize Data Transformations

Ensure your data transformations are efficient. For example, avoid unnecessary operations and use optimized functions for data manipulation.

6. Optimize Output Format and Compression

Output Format: Parquet is already a good choice for optimized storage and faster read/write operations.
Compression: Use an efficient compression algorithm like SNAPPY to reduce the storage cost and improve performance.

Here is an adjusted version of your script with some of the recommendations applied:

%idle_timeout 20
%glue_version 4.0
%worker_type Standard
%number_of_workers 1

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

# Create dynamicframe from JSON data from AWS S3, No glue catalog/crawler required for this method
connection_options = {
    "paths": ["s3://<s3-bucket-source>/"],
    "recurse": True
}

dynamic_frame = glueContext.create_dynamic_frame_from_options(
    connection_type="s3",
    connection_options=connection_options,
    format="json"
)

# Resolve choice to cast data types
json_dynamic_frame = dynamic_frame.resolveChoice(specs = [
    ('data.sensor.cpuCurrentLoad', 'cast:double'), 
    ('data.sensor.cpuLoadAverage_15', 'cast:double'), 
    ('data.sensor.cpuCurrentLoadSystem', 'cast:double'), 
    ('data.sensor.cpuCurrentLoadUser', 'cast:double'), 
    ('data.sensor.cpuLoadAverage_5', 'cast:double'),
    ('data.sensor.cpuCurrentLoadIdle', 'cast:double'),
    ('data.sensor.cpuLoadAverage_1', 'cast:double'),
    ('data.calculation.memoryFree_MB', 'cast:double'),
])

glueContext.write_dynamic_frame.from_options(
       frame = json_dynamic_frame,
       connection_type = "s3",
       connection_options = {"path": "s3://<S3-bucket-destination>/", "compression": "SNAPPY"},
       format = "parquet")

By following these recommendations, you should be able to significantly reduce the costs associated with your AWS Glue interactive sessions while maintaining the efficiency of your data processing tasks.

EXPERT

Oleksii Bebych

answered 9 days ago

EXPERT

Giovanni Lauria

reviewed 8 days ago

Pankesh Patel
9 days ago
Thanks for the answer @Oleksii Bebych.

QQ:

Do you think that I should write job.commit() necessary to write at the end of the program.?

I was using Jupyter Interactive Session while I was trying and testing? Do you think I should use "script editor" instead of Jupyter Interactive Session?
Oleksii Bebych EXPERT
9 days ago
In AWS Glue, the job.commit() function is typically used in the context of AWS Glue Jobs to signal the successful completion of the job and to perform any necessary cleanup actions. If you are running your script in a Jupyter Notebook within an AWS Glue interactive session, the use of job.commit() is not strictly necessary unless you are leveraging specific Glue job capabilities such as bookmarking or monitoring that require a job lifecycle to be tracked.
Oleksii Bebych EXPERT
9 days ago
Development Phase: Continue using Jupyter Interactive Sessions for developing and testing your ETL scripts. This environment is more flexible and interactive, allowing you to quickly test and validate your logic.

Production Phase: Once your script is thoroughly tested and ready for production, move it to the AWS Glue Script Editor. This approach ensures better cost management and leverages AWS Glue's job management capabilities.

Relevant content

How to get Glue Interactive Session Notebook to show matplotlib plot?
Accepted Answer
hai
asked a year ago
Interactive Testing of Glue Spark Scripts
Thomas Mueller
asked 2 years ago
Glue Interactive Session S3Path/ScriptLocation change
Accepted Answer
SG
asked 2 years ago
Glue Interactive Python Jupyter notebook sessions
johnclarity
asked 2 years ago
How can I use Hive and Spark on Amazon EMR to query an AWS Glue Data Catalog that's in a different AWS account?
AWS OFFICIALUpdated a year ago
Why am I unable to start my Amazon SageMaker notebook instance that's backed with an AWS Glue development endpoint?
AWS OFFICIALUpdated 3 years ago
How can I access Amazon S3 Requester Pays buckets from AWS Glue, Amazon EMR, or Amazon Athena?
AWS OFFICIALUpdated 3 years ago
How do I optimize my AWS Glue ETL workloads when reading from or writing to Amazon DynamoDB?
AWS OFFICIALUpdated 3 years ago
Push down queries when using the Google BigQuery Connector for AWS Glue
EXPERT
Fabrizio@AWS
published 2 years ago
Free AWS Live Virtual & Interactive Training: Cost Optimization for Compute Workshop
EXPERT
Markus Adhiwiyogo
published 4 months ago