Big Data
Smarter Data solutions using best practices incorporating the best of data lakes, data marts and warehouses for the flow, integration, processing, preparation and analysis of data for value driven insights
DATA FLOW
PySpark development within a Lambda or Kappa style architecture to allow for event-based streaming of data and batch processing. Technologies include Kafka and Spark streaming, IOT. ACID transactions with delta.io and Hudi.
DISCOVERY & GOVERNANCE
Metadata management, data lineage, schema’s. Masking, column and row level security.
DATA QUALITY & STEWARDSHIP
Data preparation, stewardship & data quality dimensions (e.g. completeness, accuracy) using open source (e.g. Deequ), custom frameworks. Glue Databrew and Talend.
DEVOPS
Full release and versioning with source control, containers and documentation of the environment setup including Business As Usual steps to maintain environment
DATA LAKE
Separation of data and compute with schema on read services. Lake Formation, Glue, Nifi, Talend, Hive, Presto, Athena , HBase, S3. Storage in Parquet or ORC.
DATA MINING & REPORTING
OLAP (Online Analytical Processing) based data marts, warehouses and NoSQL solutions. Redshift and Apache Kylin. OLTP using Aurora. Quicksight for reporting. SQL Analytics.
SECURITY
AWS Cognito, Google Auth and JWT Javascript Web Tokens, SAML, LDAP.
AI/ML
Modelling, Natural Language, AI and ML services with platform components from AWS using tools such as SageMaker, Comprehend, Jupyter/Zeppelin and DataBrew.