Big Data

Using Apache Iceberg with your data lake on AWS

We were facing the following challenges in a client’s data lake:

  • We needed ACID transactions for our tables.
  • We had lots of small files on S3, which slowed down performance.
  • Renaming a column or adding one column to a table required us to recreate the parquet files in the new schema.
  • We could not implement versioning or time travel.
  • Using Apache Iceberg on AWS has helped us solve these challenges quickly and easily.

Advantages of Apache Iceberg with AWS

  1. ACID Transactions: Apache Iceberg ensures data consistency through ACID transactions (Atomicity, Consistency, Isolation, Durability). This means your data updates are guaranteed to be complete and accurate, even in case of failures during writes. This is crucial for maintaining reliable data pipelines on AWS.
  2. Schema Evolution: Iceberg offers flexible schema evolution. You can easily add new columns, change data types, or remove existing ones without impacting existing data or requiring complex table rewrites. This simplifies data management as your data needs evolve.
  3. Time Travel Queries: With Iceberg, you can perform time-travel queries. These allow you to access historical versions of your data at any point in time, which is valuable for tasks like auditing changes, debugging data pipelines, or analysing trends over time.
  4. Performance Optimization: Iceberg’s OPTIMIZE function helps improve query performance on your AWS data lake stored in Athena. By restructuring data files into more efficient formats, OPTIMIZE reduces the processing overhead for Athena, leading to faster queries.
  5. Integration with AWS Services: Apache Iceberg integrates seamlessly with various AWS services like Athena, AWS Glue and Amazon EMR (Elastic MapReduce). This simplifies data ingestion, processing, and querying within your AWS environment. You can leverage existing AWS tools and workflows to manage your Iceberg tables.

Creating an Apache Iceberg table

Defining a table as an Aapche Iceberg table is as simple as setting the table type to ‘iceberg’.

CREATE TABLE your_db.your_table (
name string,
surname string
)
LOCATION 's3://s3bucket/s3folder/'
TBLPROPERTIES (
'table_type'='iceberg'
);

Using your Apache Iceberg table

Once your table is created, you can insert, update and delete records as follows.

INSERT INTO your_db.your_table ( name, surname ) VALUES (‘John’, ‘Doe’);

UPDATE your_db.your_table SET NAME = ‘Johny’ WHERE NAME= ‘John’ AND SURNAME= ‘Doe’

DELETE FROM your_db.your_table WHERE NAME= ‘John’ AND SURNAME= ‘Doe’

To time travel to a previous point in time.

SELECT * FROM your_db.your_table FOR TIMESTAMP AS OF (current_timestamp - interval '1' day)

In Apache Iceberg, every change to the data in a table creates a new version called a snapshot.

To execute a version travel query (that is, view a consistent snapshot of a specified version).

SELECT * FROM your_db.your_table FOR VERSION AS OF version

Because Iceberg tables are considered managed tables in Athena, dropping an Iceberg table also removes all the data.

DROP TABLE your_db.your_table [This will delete your data from S3]

In our AWS Glue spark jobs, we worked with these tables just like any other tables in the Glue Data Catalog.

Optimising and maintaining Apache Iceberg tables

It is recommended that some maintenance be performed on your Apache Iceberg tables.

Apache Iceberg’s OPTIMIZE function offers significant performance gains for queries on your data lake stored in Athena. As data accumulates in your Iceberg tables, query efficiency can degrade due to the growing number of data files. OPTIMIZE combats this by restructuring your data files. It can rewrite small files into larger, more efficiently readable ones or, conversely, split oversized files into smaller, more manageable ones. Without altering your data’s content, this optimisation process ensures faster query execution by reducing Athena’s processing overhead when accessing the data.

OPTIMIZE your_db.your_table REWRITE DATA USING BIN_PACK [WHERE predicate]

The compaction action is charged by the amount of data scanned during the rewrite process. The REWRITE DATA action uses predicates to select files that contain matching rows. If any row in the file matches the predicate, the file is selected for optimisation. Thus, you can specify a WHERE clause to control the number of files affected by the compaction operation.

VACUUM

Running the VACUUM statement on Iceberg tables is recommended to remove irrelevant data files and reduce metadata size and storage consumption.

Note that because the VACUUM statement makes API calls to Amazon S3, charges apply for the associated requests to Amazon S3.

VACUUM your_db.your_table

You can control the age of your snapshots with the following command.

ALTER TABLE iceberg_table SET TBLPROPERTIES ('vacuum_max_snapshot_age_seconds'='259200')

Conclusion

We effortlessly implemented Apache Iceberg on AWS as part of our data lake to provide acid transactions, time travel, versioning and schema evolution. The integration with Glue Data Catalog, Glue Spark Jobs, and Athena was seamless. Regularly compacting partitions on S3 helped to improve performance on our Apache Iceberg tables, especially when we have many small files.
By implementing Apache Iceberg, we achieved all these objectives with minimal effort.

Managing data lakes can feel like wrangling a wild beast. Data constantly evolves, queries need help to keep pace, and ensuring data consistency across concurrent operations can be a nightmare.

Riaan Adriaanse is a Solution Architect and data engineer. Riaan has been working on AWS implementations since 2010 and focused on data & analytics since 2014.