Federated Teradata database using Athena Federated Query and Join with data in Your AWS Data Lake (Amazon S3)

This post provides guidance on how to configure Amazon Athena federation with AWS Lambda to Teradata while addressing performance considerations to ensure proper use.

If you use data lakes in Amazon Simple Storage Service (Amazon S3) and use Teradata as your transactional data store and have requirement of joining your data setting in data lake on Amazon S3 with Teradata in cloud or Teradata running on EC2 or with on-prem Teradata database.

  • Team has a data lake in Amazon S3 and uses Athena. They need access to the data either from Teradata in cloud or on prem Teradata database.

  • Analysts using Athena to query their data lake for analytics need agility and flexibility to access data in a Teradata database without moving the data to Amazon S3 Data Lake.

In these scenarios, Athena federation with Athena federation feature allows you to seamlessly access the data in your Teradata database without having to wait to unload the data to the Amazon S3 data lake, which removes the overhead in managing such jobs.

In this post, you walk through a step-by-step configuration to set up Athena federation using Lambda to access data in Teradata db running on premise.

For the purpose of this blog and demonstrating Athena Federation Query, we will use Trianz Teradata Amazon Federation Query Connector. The runtime will include Teradata instance running on premise. Your Teradata instance can be on cloud, EC2 or on-premises. You would deploy the Trianz Teradata Athena Federated query Connector available in the AWS Serverless Application Repository (SAR).

Let’s start with highlighting the solution and then detailing the steps involved.

Solution overview


Data federation is the capability to integrate data in another data store using a single interface (Athena). The following diagram depicts how Athena federation works by using Lambda to integrate with a federated data source.

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Athena is easy to use. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds. With Athena, there’s no need for complex ETL jobs to prepare your data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets.

If you have data in sources other than Amazon S3, you can use Athena Federated Query to query the data in place or build pipelines that extract data from multiple data sources and store them in Amazon S3. With Athena Federated Query, you can run SQL queries across data stored in relational, non-relational, object, and custom data sources.

Athena uses data source connectors that run on AWS Lambda to run federated queries. A data source connector is a piece of code that can translate between your target data source and Athena.

Solution overview

When a query is submitted against a data source, Athena invokes the corresponding connector to identify parts of the tables that need to be read, manages parallelism, and pushes down filter predicates. Based on the user submitting the query, connectors can provide or restrict access to specific data elements.

When Athena SQL is combined with Athena federation Query Connectors, an organization can mix and match data from any source. Not only are SQL joins or aggregates possible, but also able to perform ETL using CTAS (Create Table as Select) query.

AWS Lambda lets you run code without provisioning or managing servers. You can run code for virtually any type of application with zero administration and only pay for when the code is running.

Teradata is a widely recognized database and exists as a relational database management system (RDBMS) and Teradata can be connected using code via its Java Database Connectivity (JDBC) drivers.

In Amazon Athena Federation, Athena will route Teradata specific queries via Lambda to the target Teradata instance. Inside the Lambda, JDBC code will make connection and fetch data and route it back to Athena. If the data does not fit into Lambda RAM runtime memory, it will spill the data to S3 ensuring reliable transmission. Based on the number of partitions in a Teradata Table, Athena will try to load the data by running multiple Lambda in parallel.

Prerequisites


Before you get started, create a secret for the Teradata instance with username and password using AWS Secrets Manager.

  1. On the Secrets Manager console, choose Secrets.

  2. Choose Store a new secret.

  3. Choose the Other types of secrets (e.g., API key) radio button

  4. Set the credentials as key/value pairs (username, password) for your Teradata instance.

  5. For Secret name, enter a name for your secret. Use the prefix TeradataAFQ so it’s easy to find.

  6. Leave the remaining fields at their defaults and choose Next.

  7. Complete your secret creation.

Prerequisites

Setting up your S3 bucket


On the Amazon S3 console, create a new S3 bucket and subfolder for Lambda to use. For this post, use the name athena-accelerator/teradata.

Setting up your S3 bucket

Configuring Athena federation with Teradata instance


To configure Athena federation with Teradata instance, complete the following steps:

  1. On the AWS Serverless Application Repository, choose Available applications.

  2. In the search field, enter TrianzTeradataAthenaJDBC..

    Configuring Athena federation with Oracle
  3. In the Application settings section, provide the following details:

  4. Application name - TrianzTeradataAthenaJDBC

  5. SecretNamePrefix – TeradataAFQ

  6. SpillBucket – athena-accelerator/teradata

  7. JDBCConnectorConfig –
    Format : teradata://jdbc:teradata://hostname/user=testUser&password=testPassword
    Example : teradata://jdbc:teradata://127.0.0.1/TMODE=ANSI,CHARSET=UTF8,DATABASE=TEST,${ TeradataAFQ}

  8. DisableSpillEncyption – False

  9. LambdaFunctionName – teradataconnector

  10. SecurityGroupID – Security group ID where Teradata instance is deployed. Please ensure valid Inbound and Outbound rules are applied based on your connection.

  11. SpillPrefix – Create a folder under the S3 bucket which is created in previous steps and specify the name. Example: athena-spill

  12. Subnetids – Use the subnets where Teradata instance is running with comma separation. Ensure the subnet is in VPC and has NAT Gateway and Internet Gateway attached.

  13. Select the I acknowledge check box.

  14. Choose Deploy.

Note: By default, connectors have a 7-day license key. To extend the license key or for help on deploying connectors please reach out to [email protected]

Note: Make sure the Your IAM Role and Lambda Role have permissions to access Serverless Application Repository, CloudFormation, S3, CloudWatch, CloudTrail, Secrets Manager, Lambda, Athena

Running federated queries with Athena


To start running federated queries, complete the following steps:

  1. On the Athena console, choose Workgroups.

  2. If you don’t see a workgroup called AmazonAthenaPreviewFunctionality, create one.

    When this feature becomes generally available, you won’t need to use this workgroup name.

  3. Run your queries, using lambda:teradataconnector to run against tables in Teradata Database.

Running federated queries with Athena

Joining the date set b/w Teradata and Data Lake

Joining the date set b/w SAP HANA and Data Lake

Athena query performance comparison


Several customers have asked us for performance insights and prescriptive guidance on how queries in Athena compare against federated queries, and how to use them.

In this section, we used the COVID19 standard dataset to run results from Athena with Teradata connection running from the on premise. This should give you an idea of what to expect when running federated queries against Teradata Database from the Athena.

The following table summarizes the dataset sizes:

sno

Table

Table Size (Records)

Partition table

Throughput

Time

T1COVID19936KNon partition table1250 records/second12 mins 48 seconds
T2COVID19_RP207KRange partition2617 records/second79 seconds
T3COVID19_CP205KCase partition2848 records/second72 seconds
T4COVID19_RP join with S3 covid dataTD table has 206K records S3 table has 25K recordsTD join with S34900 records/second42 seconds

We ran the following four tests:

  • T1– Non partition table select query ran from Amazon Athena and retrieved records from the Teradata database running on-premises.

  • T2 – Range partition table select query ran from Amazon Athena and retrieved records from the Teradata database running on-premises.

  • T3 – Case partition table select query ran from Amazon Athena and retrieved records from the Teradata database running on on-premises.

  • T3 – Select query ran from Amazon Athena by joining data from the Teradata table and another table from the Amazon S3.

The following graph represents the performance of some of the queries ran from Amazon Athena when connecting Teradata instance running on-premises.

AWS Athena

Key performance best practice considerations


When considering Athena federation with Teradata Database, you could take into account the following best practices:

  • Athena federation works great for queries with predicate filtering because the predicates are pushed down to Teradata Database. Use filter and limited-range scans in your queries to avoid full table scans.

  • If your SQL query requires returning a large volume of data from Teradata Database to Athena (which could lead to query timeouts or slow performance), unload the large tables in your query from Teradata to your Amazon S3 data lake.

  • Star schema is a commonly used data model in Terada Database. In the star schema model, unload your large fact tables into your data lake and leave the dimension tables in Teradata Database. If large dimension tables are contributing to slow performance or query timeouts, unload those tables to your data lake.

  • When you run federated queries, Athena spins up multiple Lambda functions, which causes a spike in database connections. It’s important to monitor the Teradata Database WLM queue slots to ensure there is no queuing. Additionally, you can use concurrency scaling on your Teradata Database cluster to benefit from concurrent connections to queue up.

Conclusion


In this post, you learned how to configure and use Athena federation with Teradata Database using Lambda. Now you don’t need to wait for all the data in your Teradata Database data warehouse to be unloaded to Amazon S3 and maintained on a day-to-day basis to run your queries.

You can use the best-practice considerations outlined in the post to minimize the data transferred from Teradata Database for better performance. When queries are well-written for federation, the performance penalties are negligible, as observed in the benchmark queries in this post.

×

Book a Demo and
FREE Proof of Value


Contact Trianz to schedule a presentation walkthrough of the Athena AFQ solution and
a free PoV.

By submitting your information, you agree to our revised  Privacy Statement.