Big Data

Smarter Data solutions using best practices incorporating the best of data lakes, data marts and warehouses for the flow, integration, processing, preparation and analysis of data for value driven insights

DATA FLOW

PySpark development within a Lambda or Kappa style architecture to allow for event-based streaming of data and batch processing. Technologies include Kafka and Spark streaming, IOT. ACID transactions with delta.io, Iceberg and Hudi.

DISCOVERY & GOVERNANCE

Metadata management, data lineage, and schema management, with masking and column- and row-level security, leveraging AWS Glue to automatically extract, organize, and govern metadata across various data sources.

DATA QUALITY & STEWARDSHIP

Data preparation, stewardship & data quality dimensions (e.g. completeness, accuracy) using open source (e.g. Deequ), custom frameworks. Glue Databrew and Sagemaker Unified Studio.

DATA LAKE

Separation of data and compute with schema on read services. Lake Formation, Glue, Hive, Athena, S3. Storage in Parquet or Iceberg.

DATA MINING & REPORTING

OLAP (Online Analytical Processing) based data marts, warehouses and NoSQL solutions. Redshift and Apache Kylin. OLTP using Aurora. Quicksight for reporting. SQL Analytics.

SECURITY

IAM, AWS Lake Formation, KMS, S3 security features (bucket policies, ACLs, Block Public Access), Amazon Macie and encryption of data at rest and in transit.