Big Data

Simplifying Data Streaming from Kafka to S3 using AWS Lambda

Exploring Potential Solutions

1. AWS Lambda:

Leveraging AWS Lambda, we can effortlessly add an event trigger to the Kafka topic, similar to an S3 event trigger. This enables the batching of incoming events based on either message count or a predefined time frame, subsequently processing them using the language of choice.

2. AWS Glue with Spark Streaming:

AWS Glue Spark Streaming offers another avenue. A dedicated Glue job runs continuously without any timeout restrictions. This job incessantly polls the Kafka topic, gathering data in batches. With Spark Streaming capabilities, ETL operations can be performed on the collected data, culminating in its landing in S3 in the desired format.

3. EKS / EC2 with Custom Client:

Alternatively, hosting an EKS or EC2 instance dedicated to data collection allows the development of a custom Kafka client. This client collects data from the Kafka topic and deposits it in S3 in the preferred format.

Alternatively, hosting an EKS or EC2 instance dedicated to data collection allows the development of a custom Kafka client. This client collects data from the Kafka topic and deposits it in S3 in the preferred format.

Our Chosen Solution and Justification

After careful consideration, we opted for the Lambda approach for several compelling reasons:

– Serverless Nature: Lambda embodies serverless technology, requiring minimal configuration and setup. Its event-driven architecture aligns seamlessly with our client’s data ingestion process requirements.

Pros of Lambda Approach:

Scalability: Lambda scales effortlessly to accommodate varying data throughput, ensuring optimal performance under fluctuating workloads.

Simplified Configuration: Minimal setup overhead translates to reduced deployment complexities and faster implementation.

Resource Efficiency: Lambda’s resource-efficient model minimizes resource consumption, optimizing cost-effectiveness.

Seamless Integration: The Lambda approach seamlessly integrates with the client’s existing ETL process, minimizing disruption and ensuring continuity.

Low Maintenance Overhead: Lambda’s managed service architecture mitigates support overhead, requiring minimal ongoing maintenance.

Cons of Lambda Approach:

Potential Timeout Issues: Lambda functions may encounter timeouts if processing exceeds the 15-minute threshold, necessitating careful optimization of processing tasks.

Additional Complexity for Customisation: Incorporating additional modules for custom coding purposes may introduce complexity, requiring adding layers to the architecture.

In contrast, while AWS Glue with Spark Streaming and EKS/EC2 options present viable alternatives, they have some challenges and considerations. AWS Glue’s continuous job execution model may lead to inefficiencies during periods of data inactivity, necessitating custom cost-saving mechanisms. Similarly, opting for EKS/EC2 entails the development of custom clients, potentially resulting in higher development costs and increased support overhead.

In conclusion, the Lambda approach emerged as the optimal solution, offering a balance of simplicity, scalability, and cost-effectiveness while seamlessly integrating with the client’s existing infrastructure. By leveraging the power of AWS Lambda, our client can streamline their data streaming processes, unlocking new possibilities for real-time analytics and insights.

The implementation of this is explained.

Architecture

Prerequisites

  • If the Kafka Topic resides in another account within the same AWS organization, a VPC peering connection must be configured to establish connectivity.
  • Credentials provided by the Kafka instance administrator are necessary to enable connection to the cluster.
  • The creation of a Topic is essential to enable consumption from the Kafka cluster.
  • If Transport Layer Security (TLS) is necessary, the Kafka cluster administrator must generate a certificate to ensure secure communication.

Setup Secret Manager Key

Create a new secret in Secrets Manager to store the credentials supplied by the Kafka administrator for your lambda. Use your username and password as your keys.

Optional:

If Transport Layer Security (TLS) is enabled on your Kafka topic, an additional step is required to ensure secure communication between your Lambda function and the Kafka cluster.

Create a new secret in Secrets Manager and copy the content of your .pem certificate supplied by your Kafka administrator in the value of the key in the following format:

{"certificate":"-----BEGIN CERTIFICATE-----XXXXXXX-----END CERTIFICATE-----"}

Lambda Setup

Create a new Lambda in the AWS console, and select your preferred development language. For the purposes of this guide, I will be using Python. Next, select an appropriate role that has the required permissions to:

  • Execute Lambdas
  • Write objects to S3
  • Read from Secrets Manager

Update your lambda configuration appropriately:

Add the Kafka Trigger to your Lambda:

  • Batch Size: The batch size refers to the number of offsets (messages) you wish to consume in a single batch.
  • Starting Position: This setting determines where the consumer will start consuming messages from. Selecting “latest offset” ensures consumption begins with the newest message on the topic, while choosing “earliest” retrieves all data residing on the topic.
  • Batch Window: In cases where the specified batch size condition isn’t met, the consumer will collect messages in timed windows, such as every 60 seconds.
  • Topic Name: Provided by your Kafka Administrator, this specifies the name of the Kafka topic from which data will be consumed.
  • Consumer Group: This associates the Kafka consumer with your Lambda function, allowing Kafka to keep track of its offsets.
  • VPC: Include the Virtual Private Cloud (VPC) where your VPC peering connection is configured. This ensures secure communication between your Lambda function and the Kafka cluster.
  • Authentication: Various authentication methods may be available, supplied by your Kafka Administrator to authenticate the Lambda function with the Kafka cluster.
  • Secret Manager Key: Add the ARN of the secret you created in the previous step that holds the credentials of your Kafka topic.

Encryption:

If Transport Layer Security (TLS) is enabled, this field holds the information of your certificate provided by your Kafka Administrator. It ensures secure communication between the Lambda function and the Kafka cluster. Add the ARN of the secret in the secrets manager you created in the previous step.

This concludes the setup required for your Lamba, if you activate your trigger your lambda will start polling your Kafka Topic for new messages.

The code:
This is a sample of code that will collect the message from kafka topic in the event payload and store it as a file on S3.

import json
import boto3
import time
import base64

s3 = boto3.client('s3')

def lambda_handler(event, context):

list = event['records']['YourTopicName']

# Initialise list to store messages
messages = []

# Loop through all messages in event
for i in list:
# Get message json values
try:
message_json = json.loads(base64.b64decode(i['value']))
messages.append(message_json)
except Exception as e:
print(str(e))

# Write messages as JSON to S3
s3.put_object(Body=json.dumps(messages), Bucket='your-bucket-name', Key='your-file-name.json')

Conclusion

I found this project incredibly rewarding as it compelled me to delve into the intricacies of Kafka from the basics. This implementation represents a straightforward approach to building a streaming solution from Kafka. Its simplicity offers a solid foundation that can easily accommodate additional functionality tailored to your needs. I trust you found this tutorial and explanation valuable. Here’s to fruitful development endeavours ahead!

In the ever-evolving data management landscape, the need to seamlessly integrate streaming data into existing batch-processing pipelines has become increasingly critical. One of our clients recently encountered this challenge when they decided to venture into event-based technologies to capture data from their point-of-sale systems.

Elbert de Wet is a certified Big Data and Analytics engineer. Elbert has been working on AWS Big Data solution implementations since 2022.