Federated Snowflake database using Athena Federated Query and Join with data in Your AWS Data Lake (Amazon S3)

Does a tool exist that can simplify data virtualization yet is inexpensive to maintain? After all, data virtualization is key to simplifying data analytics and digital transformation. But up until recently, not only has there been a shortage of inexpensive data virtualization platforms, but the existing platforms have also been incredibly complex.

That all seems to be changing. AWS Athena with its Presto engine has gotten very wide acceptability as a mature data lake solution on top of Amazon Simple Storage Service. With Athena, organizations can also manage multiple on-premises and on-cloud data sources. Now, the AWS Athena Query Federation can simplify connecting various data sources and allow Athena to be used for data virtualization.

When Athena SQL is combined with AFQ Connectors, an organization can mix and match data from any source. Not only are SQL joins or aggregates possible, but with CTAS (Create Table as Select) tactical data migration from remote data sources to Athena and S3 can also be achieved. This is very advantageous considering that SQL can be used for not only data mining but for data migration too.

As SQL is involved, it is simpler. And because the data is on S3, it is faster.

For the purpose of this blog and demonstrating Athena AFQ Federation, we will use Trianz Snowflake AFQ Connector. Your Snowflake instance will be running on the cloud infrastructure.

Imagine a situation where Snowflake is your data warehouse, and Athena is your data lake. When you need to unify data from the Snowflake warehouse and the Athena data lake, you would deploy the Trianz Snowflake AFQ Connector available in the AWS Serverless Application Repository (SAR). There are other connectors available for Teradata, Sap Hana, Oracle, Big Query, Cloudera, and others as well.

Therefore, depending upon your requirements, multiple connectors can be configured. The deployment details are more or less like the Snowflake Connector described below. You can always reach out to [email protected] for any queries and feedback.

Let’s start with highlighting the solution and then detailing the steps involved.

Solution overview


Data federation is the capability to integrate data in another data store using a single interface (Athena). The following diagram depicts how Athena federation works by using Lambda to integrate with a federated data source.

AWS Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL.

AWS Lambda lets you run code without provisioning or managing servers. You can run code for virtually any type of application with zero administration and only pay for when the code is running.

Snowflake is a widely recognized cloud based data warehouse. Snowflake can be connected using code via its Java Database Connectivity (JDBC) drivers.

Solution overview

In AFQ Federation, Athena will route Snowflake specific queries via Lambda to the target Snowflake instance. Inside the Lambda, JDBC code will make connection and fetch data and route it back to Athena. If the data does not fit into Lambda RAM runtime memory, it will spill the data to S3 ensuring reliable transmission. For snowflake, we have implemented pagination-based approach for parallel processing of data. There is a custom environment variable “pagecount” which defines the number of records per partition.

Number of partitions = Total number of records / pagecount

Based on the number of partitions in a Snowflake Table, Athena will try to load the data by running multiple Lambda in parallel.

The Trianz Snowflake AFQ Connector (and other connectors from Trianz) ensure simple, reliable, and fast transmission for data – two to three times faster than other data virtualization platforms – for joining or aggregating within Athena.

Prerequisites


Before you get started, create a secret for the Snowflake instance with username and password using AWS Secrets Manager.

  1. On the Secrets Manager console, choose Secrets.

  2. Choose Store a new secret.

  3. Choose the Other types of secrets (e.g., API key) radio button

  4. Set the credentials as key/value pairs (username, password) for your Snowflake instance.

  5. For Secret name, enter a name for your secret. Use the prefix using string snowflake so it’s easy to find.

  6. Leave the remaining fields at their defaults and choose Next.

  7. Complete your secret creation.

Prerequisites

Setting up your S3 bucket


On the Amazon S3 console, create a new S3 bucket and subfolder for Lambda to use. For this post, use the name athena-accelerator/snowflake.

Setting up your S3 bucket

Configuring Athena federation with Snowflake instance


To configure Athena federation with Snowflake instance, complete the following steps:

  1. On the AWS Serverless Application Repository, choose Available applications.

  2. In the search field, enter TrianzSnowflakeAthenaJDBC..

    Configuring Athena federation with Oracle
  3. In the Application settings section, provide the following details:

  4. Application name - TrianzSnowflakeAthenaJDBC

  5. SecretNamePrefix – trianz-snowflake-athena

  6. SpillBucket – Athena-accelerator/snowflake

  7. JDBCConnectorConfig –
    Format : snowflake://jdbc:snowflake://{snowflake_instance_url}/?warehouse={warehousename}&db={databasename}&schema={schemaname}&${secretname}
    Example : snowflake://jdbc:snowflake://trianz.snowflakecomputing.com/?warehouse=ATHENA_WH&db=ATHENA_DEV&schema=ATHENA&${trianz-snowflake-athena}

  8. DisableSpillEncyption – False

  9. LambdaFunctionName – trsnowflake

  10. SecurityGroupID – Security group ID where Snowflake instance is deployed. Please ensure valid Inbound and Outbound rules are applied based on your connection.

  11. SpillPrefix – Create a folder under the S3 bucket which is created in previous steps and specify the name. Example: athena-spill

  12. Subnetids – Use the subnets where Snowflake instance is running with comma separation. Ensure the subnet is in VPC and has NAT Gateway and Internet Gateway attached.

  13. Select the I acknowledge check box.

  14. Choose Deploy.

Note: By default, connectors have a 7-day license key. To extend the license key or for help on deploying connectors please reach out to [email protected]

Note: Make sure the Your IAM Role and Lambda Role have permissions to access Serverless Application Repository, CloudFormation, S3, CloudWatch, CloudTrail, Secrets Manager, Lambda, Athena

Running federated queries with Athena


To start running federated queries, complete the following steps:

  1. On the Athena console, choose Workgroup with Athena engine version 2. Athena Federated Query is supported only on Athena engine version.

  2. The current engine version for any workgroup can be found in Workgroups page.

  3. Run your queries, using lambda:trsnowflake to run against tables in Snowflake Database.

Running federated queries with Athena

Below is a union all query example of data from table in s3 and table in snowflake.

Below is a union all query example of data from table in s3 and table in snowflake

Athena query performance comparison


Several customers have asked us for performance insights and prescriptive guidance on how queries in Athena compare against federated queries, and how to use them.

In this section, we used the COVID19 standard dataset to run results from Athena with Snowflake connection. This should give you an idea of what to expect when running federated queries against Snowflake Database from the Athena.

The following table summarizes the dataset sizes:

sno

Table

Table Size (Records)

Partition table

Application

Total Time

Throughput

T1COVID19_DATA_20M20546048partitioned table with 300000 records per partitionSnowflake181.61 seconds113132 records/second
T2COVID19_DATA_10M10113024partitioned table with 200000 records per partitionSnowflake18.24 seconds87,695 records/second
T3COVID19_DATA_10M in S3 and COVID19_DATA_10M in Snowfake10113024partitioned table with 200000 recordsS3/ Snowflake26.06 seconds388,066 records/second

We ran the following three tests:

  • T1 – Queries ran from AFQ and retrieved records from the Snowflake instance

  • T2 – Queries ran from AFQ and retrieved records from the Snowflake instance

  • T3 – Queries ran from AFQ and retrieved records from the Snowflake instance and S3 with join query

The following graph represents the performance of some of the queries when run from AWS Athena.

AWS Athena

Key performance best practice considerations


When considering Athena federation with Snowflake Database, you could take into account the following best practices:

  • Athena federation works great for queries with predicate filtering because the predicates are pushed down to Snowflake Database. Use filter and limited-range scans in your queries to avoid full table scans.

  • If your SQL query requires returning a large volume of data from Snowflake Database to Athena (which could lead to query timeouts or slow performance), unload the large tables in your query from Snowflake to your Amazon S3 data lake.

  • Snowflake schema which is an extension of Star schema is a used data model in Snowflake Database. In the Snowflake schema model, unload your large fact tables into your data lake and leave the dimension tables in Snowflake Database. If large dimension tables are contributing to slow performance or query timeouts, unload those tables to your data lake.

  • When you run federated queries, Athena spins up multiple Lambda functions, which causes a spike in database connections. It’s important to monitor the Snowflake Database WLM queue slots to ensure there is no queuing. Additionally, you can use concurrency scaling on your Snowflake Database cluster to benefit from concurrent connections to queue up.

Conclusion


In this post, you learned how to configure and use Athena federation with Snowflake Database using Lambda. Now you don’t need to wait for all the data in your Snowflake Database data warehouse to be unloaded to Amazon S3 and maintained on a day-to-day basis to run your queries.

You can use the best-practice considerations outlined in the post to minimize the data transferred from Snowflake Database for better performance. When queries are well-written for federation, the performance penalties are negligible, as observed in the TPC-DS benchmark queries in this post.

×

Book a Demo and
FREE Proof of Value


Contact Trianz to schedule a presentation walkthrough of the Athena AFQ solution and
a free PoV.

By submitting your information, you agree to our revised  Privacy Statement.