Strategies we can implement:
Athena runs queries in a distributed query engine. Certain SQL queries run on a single node like DDL and “ORDER BY” while other queries runs in parallel. Athena scales out(not up) with the same hardware configuration to handle more complex queries.
Using columnar storage formats like ORC and Parquet which are designed for distributed analytics workloads utilizes metadata which athena can intelligently use when querying to skip over what is not needed. To improve this mechanism, the user should cleverly organise the data (e.g. sorting by value/s in your ETL jobs).
[https://aws.amazon.com/athena/pricing/]
Athena is serverless and operates on a per query billing with two main things that affects the cost:
Exploring and understanding your dataset assists you in making smart decisions at every step, whether it’s loading or querying data.
Utlising Athena’s Explain and query plan feature can assist you.
Eg.
EXPLAIN
SELECT COUNT(*)
FROM metrics
WHERE datekey = '20240312'
Query Plan
Trino version: 0.215-20718-g52025c0
Fragment 0 [SINGLE]
Output layout: [count]
Output partitioning: SINGLE []
Output[columnNames = [_col0]]
│ Layout: [count:bigint]
│ Estimates: {rows: 1 (9B), cpu: 0, memory: 0B, network: 0B}
│ _col0 := count
└─ Aggregate[type = FINAL]
│ Layout: [count:bigint]
│ Estimates: {rows: 1 (9B), cpu: 9, memory: 9B, network: 0B}
│ count := count("count_0")
└─ LocalExchange[partitioning = SINGLE]
│ Layout: [count_0:bigint]
│ Estimates: {rows: 1 (9B), cpu: 0, memory: 0B, network: 0B}
└─ RemoteSource[sourceFragmentIds = [1]]
Layout: [count_0:bigint]
Fragment 1 [SOURCE]
Output layout: [count_0]Output partitioning: SINGLE []
Aggregate[type = PARTIAL]
│ Layout: [count_0:bigint]
│ Estimates: {rows: 1 (9B), cpu: 0, memory: 9B, network: 0B}
│ count_0 := count(*)
└─ TableScan[table = awsdatacatalog$iceberg-aws:processed.metrics$data@616734414216088081 constraint on [datekey]]
Layout: []
Estimates: {rows: ? (0B), cpu: 0, memory: 0B, network: 0B}
6:datekey:varchar
:: [[20240312]]
Presorting data during the ETL process for the task offers several benefits.
Streamlining ETL and resulting in improved query performance, offering time and cost savings alongside enhanced analytical capabilities.
Partitioning divides your table into parts and keeps the related data together based on column values such as date, country, and region. Partitions act as virtual columns. You define them at table creation, and they can help reduce the amount of data scanned per query.
When deciding on which columns to use are your partition key, consider the following:
Eg.
Compressing your data can speed up your queries significantly. Smaller data sizes equals less data scanned.
Athena supports a variety of compression formats, including gzip, Snappy, and zstd. For the whole list of supported formats, see Athena compression support.
Querying compressed text data, such as JSON or CSV, requires special consideration. When Athena reads data, it assigns different ranges of files to different nodes, to maximise parallel processing of data.
Eg.
Queries run more efficiently when data can be read in parallel and as much of the data as possible is read in a single read request. To maximise the performance of queries, you should aim for a balance between the number of files and their size.
A general guideline recommended from Amazon documentation is to aim for splits that are around 128 MB in size.
Another issue you may face is over-partitioning, this is caused by having too many files which in turn reduces the performance of your queries. In the worst of cases, your queries may fail with an Amazon S3 error saying “Please reduce your request rate.”
One remedy to solve your small file problem is to use the S3DistCP utility on Amazon EMR or Vacuum on Apache Iceberg. You can use it to combine smaller files into larger objects.
Some benefits of having fewer, larger files include faster listing, fewer Amazon S3 requests, and less metadata to manage. Reading the same amount of data in Athena from one file vs. 5,000 files reduced run time by 72%.
The Athena SQL engine is built on the open source distributed query engines Trino and Presto. Understanding how it works provides insight into how you can optimise queries when running them
Only select what is needed if not the full dataset is required to benefit from a partial scan.
Eg.
The ORDER BY function is done in a single node, since it is a slow and time-consuming function. Ideally it should be avoided, however, if in your current use case it must be used, it is recommended placing a LIMIT to improve query time.
For example:
The following example contains a select with order by date key with and without a limit:
It is also critical to choose the correct join for your operations to ensure fast and effective performance from your queries. When joining two tables with varying sizes you should join a larger table on the left side of the join and the smaller table on the right side.
For the most common type of join that uses equality conditions, Athena builds a lookup table from the table on the right and distributes it to the worker nodes. It then streams the table on the left joining rows by looking up matching values from the lookup table. An important thing to note is; that the lookup table built from the table on the right side is kept in memory, and that the smaller that table is, the less memory will be used, and the faster the join will run.
For example when performing a join where all the data is selected instead of focusing on specific columns that are needed to perform the needed query to retrieve the optimal result.
Eg.
When querying partitioned tables, it’s best to include filters on all partition keys, for all tables. This ensures that the query planner can skip as much as possible of listing and reading files.
For example two queries where the JOIN is performed without and with its partition key which in this case is the datekey for both tables:
Eg.
Group by columns should be arranged by cardinality to benefit from improved query times
Eg.
Athena benefits from pattern matching optimisations utilising REGEX over LIKE querying.
Eg.
In conclusion, the benefits of each of these optimisations can yield significant improvements in both query performance and cost savings varying between users. However its importing to recognise utilising all synergistically and maximising efficiency in data analysis at scale.
With data growing indefinitely and becoming more complex with each day querying said data can be a daunting task, often leading to inefficiencies, high latency and inflated costs. In our pursuit to utilize the power of data without breaking the bank, we encounter several challenges but with understanding how AWS Athena operates we can implement optimizations and best practices to overcome these obstacles and achieve cost-effective data analysis.
Zubair Samsodien is a certified Big Data and Analytics engineer. Zubair has been working on AWS Big Data & analytics since 2020.