Optimizing AWS Glue Costs: How to Reduce Expenses for Interactive Sessions?

0

I've been using an AWS Glue interactive Jupyter Notebook to write a script. This script reads JSON data from an S3 bucket, transforms the data types, and writes the output as a Parquet file back to S3.

After approximately 6-7 hours of testing in last 2 days, I was surprised to find that my AWS Glue usage cost me about $160. The cost breakdown is $0.44 per Data Processing Unit-Hour for Glue interactive sessions and job notebooks, totaling 365.766 DPU-Hours.

I'm wondering what might be causing such high costs with this script. Any advice on configuration parameters to minimize future costs would be greatly appreciated.

%idle_timeout 100
%glue_version 4.0
%worker_type G.1X
%number_of_workers 2

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)


# Create dynamicframe from JSON data from AWS S3, No glue catalog/crawler require for this method
connection_options = {
    "paths": ["s3://<s3-bucket-source>/"],
    "recurse": True
}

dynamic_frame = glueContext.create_dynamic_frame_from_options(
    connection_type="s3",
    connection_options=connection_options,
    format = "json"
)
# dynamic_frame.toDF().show()
dynamic_frame.toDF().printSchema()


json_dynamic_frame=dynamic_frame.resolveChoice(specs = [
    ('data.sensor.cpuCurrentLoad','cast:double'), 
    ('data.sensor.cpuLoadAverage_15','cast:double'), 
    ('data.sensor.cpuCurrentLoadSystem','cast:double'), 
    ('data.sensor.cpuCurrentLoadUser','cast:double'), 
    ('data.sensor.cpuLoadAverage_5','cast:double'),
    ('data.sensor.cpuCurrentLoadIdle','cast:double'),
    ('data.sensor.cpuLoadAverage_1','cast:double'),
    ('data.calculation.memoryFree_MB','cast:double'),
])
json_dynamic_frame.printSchema()

glueContext.write_dynamic_frame.from_options(
       frame = json_dynamic_frame,
       connection_type = "s3",
       connection_options = {"path": "s3://<S3-bucket-destination>/", "compression": "NONE"},
       format = "parquet")
1 Answer
0
Accepted Answer

1. Reduce Idle Timeout

The %idle_timeout setting in your script is currently set to 100 minutes. Reducing this value can help minimize the time your Glue resources are running without performing any tasks. For example, setting it to 10 or 20 minutes might be more cost-effective.

%idle_timeout 20

2. Optimize Worker Type and Number of Workers

The choice of worker type and the number of workers significantly impacts your costs. In your script, you are using G.1X worker type with 2 workers. Consider the following adjustments:

  • Worker Type: If your job does not require the memory and CPU of G.1X, you can switch to a smaller worker type like Standard. However, ensure that this does not adversely affect your job's performance.
  • Number of Workers: Start with 1 worker and scale up only if needed. This reduces the initial cost and helps you better understand the actual resource requirements of your job.
%worker_type Standard
%number_of_workers 1

3. Use AWS Glue Job Bookmarks

AWS Glue job bookmarks help keep track of the data that has already been processed. This can prevent reprocessing of the same data and save costs, especially useful if you are testing and re-running your script multiple times.

4. Efficient Data Handling

  • Read Data in Batches: If you have a large dataset, read the data in batches instead of loading it all at once. This can help manage memory usage and reduce the overall processing time.
  • Filter Data Early: Apply any filters or transformations early in the process to minimize the amount of data being processed.

5. Optimize Data Transformations

Ensure your data transformations are efficient. For example, avoid unnecessary operations and use optimized functions for data manipulation.

6. Optimize Output Format and Compression

  • Output Format: Parquet is already a good choice for optimized storage and faster read/write operations.
  • Compression: Use an efficient compression algorithm like SNAPPY to reduce the storage cost and improve performance.

Here is an adjusted version of your script with some of the recommendations applied:

%idle_timeout 20
%glue_version 4.0
%worker_type Standard
%number_of_workers 1

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

# Create dynamicframe from JSON data from AWS S3, No glue catalog/crawler required for this method
connection_options = {
    "paths": ["s3://<s3-bucket-source>/"],
    "recurse": True
}

dynamic_frame = glueContext.create_dynamic_frame_from_options(
    connection_type="s3",
    connection_options=connection_options,
    format="json"
)

# Resolve choice to cast data types
json_dynamic_frame = dynamic_frame.resolveChoice(specs = [
    ('data.sensor.cpuCurrentLoad', 'cast:double'), 
    ('data.sensor.cpuLoadAverage_15', 'cast:double'), 
    ('data.sensor.cpuCurrentLoadSystem', 'cast:double'), 
    ('data.sensor.cpuCurrentLoadUser', 'cast:double'), 
    ('data.sensor.cpuLoadAverage_5', 'cast:double'),
    ('data.sensor.cpuCurrentLoadIdle', 'cast:double'),
    ('data.sensor.cpuLoadAverage_1', 'cast:double'),
    ('data.calculation.memoryFree_MB', 'cast:double'),
])

glueContext.write_dynamic_frame.from_options(
       frame = json_dynamic_frame,
       connection_type = "s3",
       connection_options = {"path": "s3://<S3-bucket-destination>/", "compression": "SNAPPY"},
       format = "parquet")

By following these recommendations, you should be able to significantly reduce the costs associated with your AWS Glue interactive sessions while maintaining the efficiency of your data processing tasks.

profile picture
EXPERT
answered 9 days ago
profile picture
EXPERT
reviewed 8 days ago
  • Thanks for the answer @Oleksii Bebych.

    QQ:

    • Do you think that I should write job.commit() necessary to write at the end of the program.?

    • I was using Jupyter Interactive Session while I was trying and testing? Do you think I should use "script editor" instead of Jupyter Interactive Session?

  • In AWS Glue, the job.commit() function is typically used in the context of AWS Glue Jobs to signal the successful completion of the job and to perform any necessary cleanup actions. If you are running your script in a Jupyter Notebook within an AWS Glue interactive session, the use of job.commit() is not strictly necessary unless you are leveraging specific Glue job capabilities such as bookmarking or monitoring that require a job lifecycle to be tracked.

  • Development Phase: Continue using Jupyter Interactive Sessions for developing and testing your ETL scripts. This environment is more flexible and interactive, allowing you to quickly test and validate your logic.

    Production Phase: Once your script is thoroughly tested and ready for production, move it to the AWS Glue Script Editor. This approach ensures better cost management and leverages AWS Glue's job management capabilities.