Dynamic data querying

DuckDB

AWS Lambda

AWS S3

Serverless

Streamlining AWS Lambda with DuckDB for Dynamic Data Handling

by: Jerrish Varghese

January 08, 2024

In the world of serverless computing, AWS Lambda has emerged as a powerful solution for running code in response to triggers without the need to provision or manage servers. Combining Lambda with DuckDB, an in-process SQL database, opens up intriguing possibilities for data-intensive applications. This article delves into a practical scenario where Lambda and DuckDB are leveraged to execute SQL queries dynamically based on incoming requests, showcasing flexibility and efficiency in data handling.

Scenario Overview

Imagine a system where data stored in S3 needs to be queried dynamically based on parameters provided at runtime. AWS Lambda comes into play by fetching the data, loading it into DuckDB, executing the SQL query, and returning the results. This setup is particularly useful for applications requiring ad-hoc query capabilities over datasets without the overhead of a traditional database system.

Key Components and Setup

AWS Lambda: A serverless compute service running code in response to triggers.
DuckDB: An in-process SQL OLAP database management system, renowned for its high performance and ease of integration.
AWS S3: A scalable storage service used to store and retrieve the dataset.

Implementing the Solution

Step 1: Preparing the AWS Lambda Function

The Lambda function is designed to accept parameters such as file_name, sql, and bind_params, determining the file to fetch from S3, the SQL query to execute, and the parameters to bind to the query, respectively.

def lambda_handler(event, context):
    file_name = event.get('file_name')
    sql = event.get('sql')
    bind_params = event.get('bind_params', {})

Step 2: Fetching Data from AWS S3

Using the Boto3 library, the Lambda function retrieves the specified file from S3, storing it temporarily for DuckDB to access.

s3 = boto3.client('s3')
bucket_name = os.getenv('BUCKET_NAME')
local_file_name = f"/tmp/{file_name}"
s3.download_file(bucket_name, file_name, local_file_name)

Step 3: Querying Data with DuckDB

The function then connects to DuckDB, loads the JSON data, and executes the provided SQL query. DuckDB's ability to directly query JSON files simplifies data loading and querying processes.

conn = duckdb.connect(database=':memory:', read_only=False)
prepared_sql = sql % local_file_name  # Embed the file path into the SQL query
results = conn.execute(prepared_sql, parameters=list(bind_params.values())).fetchall()

Step 4: Handling the Response

The function formats the query results, ensuring dates are serialized in a human-readable format (dd-MM-yyyy), and returns the formatted data.

def serialize_dates(row):
    for key, value in row.items():
        if isinstance(value, datetime.date):
            row[key] = value.strftime('%d-%m-%Y')
    return row

fetched_results = [serialize_dates(row) for row in fetched_results]

Step 5: Deploying to Production

After thorough testing, the Lambda function is deployed to production, providing a scalable and efficient solution for querying JSON data stored in S3 on-the-fly.

Conclusion

Integrating AWS Lambda with DuckDB offers a compelling solution for dynamic data querying and handling, demonstrating the power and flexibility of serverless architectures. This approach simplifies infrastructure management while providing fast and efficient data processing capabilities, making it ideal for a broad spectrum of applications, from analytics to data transformation efforts.

By embracing serverless computing and leveraging cutting-edge database technologies, developers can construct scalable, efficient, and highly adaptable data processing pipelines. Such innovations not only streamline operations but also pave the way for new possibilities in data handling and analysis.

For those interested in implementing this solution or exploring the code further, you can find the complete code example on GitHub: AWS Lambda with DuckDB - Example Code.

Embracing serverless architectures and modern database solutions like DuckDB enables developers to focus more on delivering value and less on managing infrastructure, marking a significant step forward in the evolution of data processing and management.

Uploading and Retrieving Files securely from Azure Blob Storage using SAS token

Sangeetha Saha

March 13, 2024

Empowering Women, Empowering Futures - Inspiring Women's Journeys

Sachin Biju, Anisha Mariam Thomas, Sangeeta Saha, Nancy Alfred

March 08, 2024

Unraveling the Java vs Python Dilemma: Which Programming Language Reigns Supreme?

Christina Antony

February 21, 2024

Get started now

Get a quote for your project.