We were facing the following challenges in a client’s data lake:
Defining a table as an Aapche Iceberg table is as simple as setting the table type to ‘iceberg’.
CREATE TABLE your_db.your_table (
name string,
surname string
)
LOCATION 's3://s3bucket/s3folder/'
TBLPROPERTIES (
'table_type'='iceberg'
);
Once your table is created, you can insert, update and delete records as follows.
INSERT INTO your_db.your_table ( name, surname ) VALUES (‘John’, ‘Doe’);
UPDATE your_db.your_table SET NAME = ‘Johny’ WHERE NAME= ‘John’ AND SURNAME= ‘Doe’
DELETE FROM your_db.your_table WHERE NAME= ‘John’ AND SURNAME= ‘Doe’
To time travel to a previous point in time.
SELECT * FROM your_db.your_table FOR TIMESTAMP AS OF (current_timestamp - interval '1' day)
In Apache Iceberg, every change to the data in a table creates a new version called a snapshot.
To execute a version travel query (that is, view a consistent snapshot of a specified version).
SELECT * FROM your_db.your_table FOR VERSION AS OF version
Because Iceberg tables are considered managed tables in Athena, dropping an Iceberg table also removes all the data.
DROP TABLE your_db.your_table [This will delete your data from S3]
In our AWS Glue spark jobs, we worked with these tables just like any other tables in the Glue Data Catalog.
It is recommended that some maintenance be performed on your Apache Iceberg tables.
Apache Iceberg’s OPTIMIZE function offers significant performance gains for queries on your data lake stored in Athena. As data accumulates in your Iceberg tables, query efficiency can degrade due to the growing number of data files. OPTIMIZE combats this by restructuring your data files. It can rewrite small files into larger, more efficiently readable ones or, conversely, split oversized files into smaller, more manageable ones. Without altering your data’s content, this optimisation process ensures faster query execution by reducing Athena’s processing overhead when accessing the data.
OPTIMIZE your_db.your_table REWRITE DATA USING BIN_PACK [WHERE predicate]
The compaction action is charged by the amount of data scanned during the rewrite process. The REWRITE DATA action uses predicates to select files that contain matching rows. If any row in the file matches the predicate, the file is selected for optimisation. Thus, you can specify a WHERE clause to control the number of files affected by the compaction operation.
Running the VACUUM statement on Iceberg tables is recommended to remove irrelevant data files and reduce metadata size and storage consumption.
Note that because the VACUUM statement makes API calls to Amazon S3, charges apply for the associated requests to Amazon S3.
VACUUM your_db.your_table
You can control the age of your snapshots with the following command.
ALTER TABLE iceberg_table SET TBLPROPERTIES ('vacuum_max_snapshot_age_seconds'='259200')
We effortlessly implemented Apache Iceberg on AWS as part of our data lake to provide acid transactions, time travel, versioning and schema evolution. The integration with Glue Data Catalog, Glue Spark Jobs, and Athena was seamless. Regularly compacting partitions on S3 helped to improve performance on our Apache Iceberg tables, especially when we have many small files.
By implementing Apache Iceberg, we achieved all these objectives with minimal effort.
Managing data lakes can feel like wrangling a wild beast. Data constantly evolves, queries need help to keep pace, and ensuring data consistency across concurrent operations can be a nightmare.
Riaan Adriaanse is a Solution Architect and data engineer. Riaan has been working on AWS implementations since 2010 and focused on data & analytics since 2014.