Leveraging AWS Lambda, we can effortlessly add an event trigger to the Kafka topic, similar to an S3 event trigger. This enables the batching of incoming events based on either message count or a predefined time frame, subsequently processing them using the language of choice.
AWS Glue Spark Streaming offers another avenue. A dedicated Glue job runs continuously without any timeout restrictions. This job incessantly polls the Kafka topic, gathering data in batches. With Spark Streaming capabilities, ETL operations can be performed on the collected data, culminating in its landing in S3 in the desired format.
Alternatively, hosting an EKS or EC2 instance dedicated to data collection allows the development of a custom Kafka client. This client collects data from the Kafka topic and deposits it in S3 in the preferred format.
Alternatively, hosting an EKS or EC2 instance dedicated to data collection allows the development of a custom Kafka client. This client collects data from the Kafka topic and deposits it in S3 in the preferred format.
After careful consideration, we opted for the Lambda approach for several compelling reasons:
– Serverless Nature: Lambda embodies serverless technology, requiring minimal configuration and setup. Its event-driven architecture aligns seamlessly with our client’s data ingestion process requirements.
– Scalability: Lambda scales effortlessly to accommodate varying data throughput, ensuring optimal performance under fluctuating workloads.
– Simplified Configuration: Minimal setup overhead translates to reduced deployment complexities and faster implementation.
– Resource Efficiency: Lambda’s resource-efficient model minimizes resource consumption, optimizing cost-effectiveness.
– Seamless Integration: The Lambda approach seamlessly integrates with the client’s existing ETL process, minimizing disruption and ensuring continuity.
– Low Maintenance Overhead: Lambda’s managed service architecture mitigates support overhead, requiring minimal ongoing maintenance.
– Potential Timeout Issues: Lambda functions may encounter timeouts if processing exceeds the 15-minute threshold, necessitating careful optimization of processing tasks.
– Additional Complexity for Customisation: Incorporating additional modules for custom coding purposes may introduce complexity, requiring adding layers to the architecture.
In contrast, while AWS Glue with Spark Streaming and EKS/EC2 options present viable alternatives, they have some challenges and considerations. AWS Glue’s continuous job execution model may lead to inefficiencies during periods of data inactivity, necessitating custom cost-saving mechanisms. Similarly, opting for EKS/EC2 entails the development of custom clients, potentially resulting in higher development costs and increased support overhead.
In conclusion, the Lambda approach emerged as the optimal solution, offering a balance of simplicity, scalability, and cost-effectiveness while seamlessly integrating with the client’s existing infrastructure. By leveraging the power of AWS Lambda, our client can streamline their data streaming processes, unlocking new possibilities for real-time analytics and insights.
Create a new secret in Secrets Manager to store the credentials supplied by the Kafka administrator for your lambda. Use your username and password as your keys.
Optional:
If Transport Layer Security (TLS) is enabled on your Kafka topic, an additional step is required to ensure secure communication between your Lambda function and the Kafka cluster.
Create a new secret in Secrets Manager and copy the content of your .pem certificate supplied by your Kafka administrator in the value of the key in the following format:
{"certificate":"-----BEGIN CERTIFICATE-----XXXXXXX-----END CERTIFICATE-----"}
Create a new Lambda in the AWS console, and select your preferred development language. For the purposes of this guide, I will be using Python. Next, select an appropriate role that has the required permissions to:
Update your lambda configuration appropriately:
Add the Kafka Trigger to your Lambda:
Encryption:
If Transport Layer Security (TLS) is enabled, this field holds the information of your certificate provided by your Kafka Administrator. It ensures secure communication between the Lambda function and the Kafka cluster. Add the ARN of the secret in the secrets manager you created in the previous step.
This concludes the setup required for your Lamba, if you activate your trigger your lambda will start polling your Kafka Topic for new messages.
The code:
This is a sample of code that will collect the message from kafka topic in the event payload and store it as a file on S3.
import json
import boto3
import time
import base64
s3 = boto3.client('s3')
def lambda_handler(event, context):
list = event['records']['YourTopicName']
# Initialise list to store messages
messages = []
# Loop through all messages in event
for i in list:
# Get message json values
try:
message_json = json.loads(base64.b64decode(i['value']))
messages.append(message_json)
except Exception as e:
print(str(e))
# Write messages as JSON to S3
s3.put_object(Body=json.dumps(messages), Bucket='your-bucket-name', Key='your-file-name.json')
I found this project incredibly rewarding as it compelled me to delve into the intricacies of Kafka from the basics. This implementation represents a straightforward approach to building a streaming solution from Kafka. Its simplicity offers a solid foundation that can easily accommodate additional functionality tailored to your needs. I trust you found this tutorial and explanation valuable. Here’s to fruitful development endeavours ahead!
In the ever-evolving data management landscape, the need to seamlessly integrate streaming data into existing batch-processing pipelines has become increasingly critical. One of our clients recently encountered this challenge when they decided to venture into event-based technologies to capture data from their point-of-sale systems.
Elbert de Wet is a certified Big Data and Analytics engineer. Elbert has been working on AWS Big Data solution implementations since 2022.