Big Data

AWS Athena Query Best practices and Cost Savings

Strategies we can implement:

  • Optimise Datasets
  • File organising and management
  • Query Tuning

How AWS Athena Operates

Parallel queries

Athena runs queries in a distributed query engine. Certain SQL queries run on a single node like DDL and “ORDER BY” while other queries runs in parallel. Athena scales out(not up) with the same hardware configuration to handle more complex queries.

Metadata-driven read optimization

Using columnar storage formats like ORC and Parquet which are designed for distributed analytics workloads utilizes metadata which athena can intelligently use when querying to skip over what is not needed. To improve this mechanism, the user should cleverly organise the data (e.g. sorting by value/s in your ETL jobs).

Pricing Model

[https://aws.amazon.com/athena/pricing/]

Athena is serverless and operates on a per query billing with two main things that affects the cost:

  • Data Scanned
  • Query time

Athena Performance Limitations

  • Athena is a distributed query engine, which uses S3 as its underlying storage engine. Unlike full database products, it does not have its own optimised storage layer. Therefore its performance is strongly dependent on how data is organised in S3
  • Another important note is that Athena has no indexes—it relies on fast full table scans. By adding “LIMIT” and/or “WHERE” on partitioned data to your SQL queries it scans only what is needed resulting in massive cost savings

Performance Tuning Tips

Creating optimised datasets

Understand your Data

Exploring and understanding your dataset assists you in making smart decisions at every step, whether it’s loading or querying data.

Utlising Athena’s Explain and query plan feature can assist you.

Eg.

EXPLAIN
SELECT COUNT(*)
FROM metrics
WHERE datekey = '20240312'

Query Plan
Trino version: 0.215-20718-g52025c0
Fragment 0 [SINGLE]
Output layout: [count]
Output partitioning: SINGLE []
Output[columnNames = [_col0]]
│ Layout: [count:bigint]
│ Estimates: {rows: 1 (9B), cpu: 0, memory: 0B, network: 0B}
│ _col0 := count
└─ Aggregate[type = FINAL]
│ Layout: [count:bigint]
│ Estimates: {rows: 1 (9B), cpu: 9, memory: 9B, network: 0B}
│ count := count("count_0")
└─ LocalExchange[partitioning = SINGLE]
│ Layout: [count_0:bigint]
│ Estimates: {rows: 1 (9B), cpu: 0, memory: 0B, network: 0B}
└─ RemoteSource[sourceFragmentIds = [1]]
Layout: [count_0:bigint]

Fragment 1 [SOURCE]
Output layout: [count_0]Output partitioning: SINGLE []
Aggregate[type = PARTIAL]
│ Layout: [count_0:bigint]
│ Estimates: {rows: 1 (9B), cpu: 0, memory: 9B, network: 0B}
│ count_0 := count(*)
└─ TableScan[table = awsdatacatalog$iceberg-aws:processed.metrics$data@616734414216088081 constraint on [datekey]]
Layout: []
Estimates: {rows: ? (0B), cpu: 0, memory: 0B, network: 0B}
6:datekey:varchar
:: [[20240312]]

Sorting

Presorting data during the ETL process for the task offers several benefits.

  • Leveraging ETL’s compute resources for the task (Athena order by is a single node operation)
  • Enhance efficiency with data loading
  • Organizing data before loading enhances file layout in storage
  • Improve file metadata

Streamlining ETL and resulting in improved query performance, offering time and cost savings alongside enhanced analytical capabilities.

File organising and management

Partitioning data

Partitioning divides your table into parts and keeps the related data together based on column values such as date, country, and region. Partitions act as virtual columns. You define them at table creation, and they can help reduce the amount of data scanned per query.

When deciding on which columns to use are your partition key, consider the following:

  • Pick keys that support your queries based how it is queried/filtered.
  • Partition keys should have relatively low cardinality.
  • Also consider data that is heavily skewed to one partition value.
  • Partition by date when a dataset is set to grow over time(queries will remain constant)

Eg.

Compress and split files

Compressing your data can speed up your queries significantly. Smaller data sizes equals less data scanned.

Athena supports a variety of compression formats, including gzip, Snappy, and zstd. For the whole list of supported formats, see Athena compression support.

Querying compressed text data, such as JSON or CSV, requires special consideration. When Athena reads data, it assigns different ranges of files to different nodes, to maximise parallel processing of data.

Compression types:

  • When creating datasets consisting of compressed text files aim to balance the number of files and their file sizes.
  • Parquet and ORC files are always splittable because these formats compress sections of the files separately, and have metadata that contains the locations within the files for the different sections.
  • The gzip format provides good compression ratios and has a wide range support across other tools and services.
  • The zstd (Zstandard) format is a newer compression format with a good balance between performance and compression ratio.
  • The bzip2 and LZO compression formats are splittable, but are not recommended if you want performance and compatibility.

Eg.

Optimise file sizes

Queries run more efficiently when data can be read in parallel and as much of the data as possible is read in a single read request. To maximise the performance of queries, you should aim for a balance between the number of files and their size.

A general guideline recommended from Amazon documentation is to aim for splits that are around 128 MB in size.

Another issue you may face is over-partitioning, this is caused by having too many files which in turn reduces the performance of your queries. In the worst of cases, your queries may fail with an Amazon S3 error saying “Please reduce your request rate.”

One remedy to solve your small file problem is to use the S3DistCP utility on Amazon EMR or Vacuum on Apache Iceberg. You can use it to combine smaller files into larger objects.

Some benefits of having fewer, larger files include faster listing, fewer Amazon S3 requests, and less metadata to manage. Reading the same amount of data in Athena from one file vs. 5,000 files reduced run time by 72%.

Query tuning

The Athena SQL engine is built on the open source distributed query engines Trino and Presto. Understanding how it works provides insight into how you can optimise queries when running them

“SELECT” query

Only select what is needed if not the full dataset is required to benefit from a partial scan.

Eg.

“ORDER BY” query

The ORDER BY function is done in a single node, since it is a slow and time-consuming function. Ideally it should be avoided, however, if in your current use case it must be used, it is recommended placing a LIMIT to improve query time.

For example:

The following example contains a select with order by date key with and without a limit:

“Joins” query

It is also critical to choose the correct join for your operations to ensure fast and effective performance from your queries. When joining two tables with varying sizes you should join a larger table on the left side of the join and the smaller table on the right side.

For the most common type of join that uses equality conditions, Athena builds a lookup table from the table on the right and distributes it to the worker nodes. It then streams the table on the left joining rows by looking up matching values from the lookup table. An important thing to note is; that the lookup table built from the table on the right side is kept in memory, and that the smaller that table is, the less memory will be used, and the faster the join will run.

For example when performing a join where all the data is selected instead of focusing on specific columns that are needed to perform the needed query to retrieve the optimal result.

Eg.

When querying partitioned tables, it’s best to include filters on all partition keys, for all tables. This ensures that the query planner can skip as much as possible of listing and reading files.

For example two queries where the JOIN is performed without and with its partition key which in this case is the datekey for both tables:

Eg.

“GROUP BY” query

Group by columns should be arranged by cardinality to benefit from improved query times

Eg.

“LIKE” query

Athena benefits from pattern matching optimisations utilising REGEX over LIKE querying.

Eg.

Conclusion

In conclusion, the benefits of each of these optimisations can yield significant improvements in both query performance and cost savings varying between users. However its importing to recognise utilising all synergistically and maximising efficiency in data analysis at scale.

With data growing indefinitely and becoming more complex with each day querying said data can be a daunting task, often leading to inefficiencies, high latency and inflated costs. In our pursuit to utilize the power of data without breaking the bank, we encounter several challenges but with understanding how AWS Athena operates we can implement optimizations and best practices to overcome these obstacles and achieve cost-effective data analysis.

Zubair Samsodien is a certified Big Data and Analytics engineer. Zubair has been working on AWS Big Data & analytics since 2020.