Configure and optimize performance of Amazon Athena federation with Trianz Oracle AFQ

Does a tool exist that can simplify data virtualization yet is inexpensive to maintain? After all, data virtualization is key to simplifying data analytics and digital transformation. But up until recently, not only has there been a shortage of inexpensive data virtualization platforms, but the existing platforms have also been incredibly complex.

That all seems to be changing. AWS Athena with its Presto engine has gotten very wide acceptability as a mature data lake solution on top of Amazon Simple Storage Service. With Athena, organizations can also manage multiple on-premises and on-cloud data sources. Now, the AWS Athena Query Federation can simplify connecting various data sources and allow Athena to be used for data virtualization.

When Athena SQL is combined with AFQ Connectors, an organization can mix and match data from any source. Not only are SQL joins or aggregates possible, but with CTAS (Create Table as Select) tactical data migration from remote data sources to Athena and S3 can also be achieved. This is very advantageous considering that SQL can be used for not only data mining but for data migration too.

As SQL is involved, it is simpler. And because the data is on S3, it is faster.

For the purpose of this blog and demonstrating Athena AFQ Federation, we will use Trianz Oracle AFQ Connector. The runtime will include Oracle XE running on EC2 and RDS. Your Oracle instance can be on RDS, EC2 or on-premises.

Imagine a situation where Oracle (Exadata) is your data warehouse, and Athena is your data lake. When you need to unify data from the Oracle warehouse and the Athena data lake, you would deploy the Trianz Oracle AFQ Connector available in the AWS Serverless Application Repository (SAR). There are other connectors available for Teradata, Sap Hana, Snowflake, Big Query, Cloudera, and others as well.

Therefore, depending upon your requirements, multiple connectors can be configured. The deployment details are more or less like the Oracle Connector described below. You can always reach out to [email protected] for any queries and feedback.

Let’s start with highlighting the solution and then detailing the steps involved.

Solution overview


Data federation is the capability to integrate data in another data store using a single interface (Athena). The following diagram depicts how Athena federation works by using Lambda to integrate with a federated data source.

AWS Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL.

AWS Lambda lets you run code without provisioning or managing servers. You can run code for virtually any type of application with zero administration and only pay for when the code is running.

Oracle is a widely recognized database and exists as a relational database management system (RDBMS) and as a data warehouse (Exadata). Oracle can be connected using code via its Java Database Connectivity (JDBC) drivers.

Solution overview

In AFQ Federation, Athena will route Oracle specific queries via Lambda to the target Oracle instance. Inside the Lambda, JDBC code will make connection and fetch data and route it back to Athena. If the data does not fit into Lambda RAM runtime memory, it will spill the data to S3 ensuring reliable transmission. Based on the number of partitions in an Oracle Table, Athena will try to load the data by running multiple Lambda in parallel.

The Trianz Oracle AFQ Connector (and other connectors from Trianz) ensure simple, reliable, and fast transmission for data – two to three times faster than other data virtualization platforms – for joining or aggregating within Athena.

Prerequisites


Before you get started, create a secret for the Oracle instance with username and password using AWS Secrets Manager.

  1. On the Secrets Manager console, choose Secrets.

  2. Choose Store a new secret.

  3. Choose the Other types of secrets (e.g., API key) radio button

  4. Set the credentials as key/value pairs (username, password) for your Oracle XE instance.

  5. For Secret name, enter a name for your secret. Use the prefix OracleAFQ so it’s easy to find.

  6. Leave the remaining fields at their defaults and choose Next.

  7. Complete your secret creation.

Prerequisites

Setting up your S3 bucket


On the Amazon S3 console, create a new S3 bucket and subfolder for Lambda to use. For this post, use the name athena-accelerator/oracle.

Setting up your S3 bucket

Configuring Athena federation with Oracle XE instance


To configure Athena federation with Oracle instance, complete the following steps:

  1. On the AWS Serverless Application Repository, choose Available applications.

  2. In the search field, enter TrianzOracleAthenaJDBC.

    Configuring Athena federation with Oracle
  3. In the Application settings section, provide the following details:

  4. Application name - TrianzOracleAthenaJDBC

  5. SecretNamePrefix – OracleAFQ_XE

  6. SpillBucket – Athena-accelerator/oracle

  7. JDBCConnectorConfig –
    Format : oracle://jdbc:oracle:thin:${secretname}@//hostname:port/servicename
    Example : oracle://jdbc:oracle:thin:${OracleAFQ_XE}@//12.345.67.89:1521/xe

  8. DisableSpillEncyption – False

  9. LambdaFunctionName – oracleconnector

  10. SecurityGroupID – Security group ID where Oracle instance is deployed. Please ensure valid Inbound and Outbound rules are applied based on your connection.

  11. SpillPrefix – Create a folder under the S3 bucket which is created in previous steps and specify the name. Example: athena-spill

  12. Subnetids – Use the subnets where Oracle instance is running with comma separation. Ensure the subnet is in VPC and has NAT Gateway and Internet Gateway attached.

  13. Select the I acknowledge check box.

  14. Choose Deploy.

Note: By default, connectors have a 7-day license key. To extend the license key or for help on deploying connectors please reach out to [email protected]

Note: Make sure the Your IAM Role and Lambda Role have permissions to access Serverless Application Repository, CloudFormation, S3, CloudWatch, CloudTrail, Secrets Manager, Lambda, Athena

Running federated queries with Athena


To start running federated queries, complete the following steps:

  1. On the Athena console, choose Workgroups.

  2. If you don’t see a workgroup called AmazonAthenaPreviewFunctionality, create one.

    When this feature becomes generally available, you won’t need to use this workgroup name.

  3. Run your queries, using lambda:oracleconnector to run against tables in Oracle Database.

Running federated queries with Athena

Athena query performance comparison


Several customers have asked us for performance insights and prescriptive guidance on how queries in Athena compare against federated queries, and how to use them.

In this section, we used the COVID19 standard dataset to run results from Athena with Oracle connection using AWS RDS and Oracle from the EC2 instance. This should give you an idea of what to expect when running federated queries against Oracle Database from the Athena.

The following table summarizes the dataset sizes:

sno

Table

Table Size (Records)

Partition table

Application

Time

T1COVID19_DATA210KNon partition tableOracle 11g installed in EC21420 records/second
T2COVID19_DATA215KNon partition tableRDS from AWS2400 records/second
T3COVID19_RANGE_PARTITION215KRange partitionRDS from AWS2800 records/second
T4COVID19_DATA_LIST_PARTITION215KList partitionRDS from AWS2650 records/second

We ran the following four tests:

  • T1 – Queries ran from AWS EC2 which will connect AWS Athena with JDBC and retrieved records from the Oracle 11g installed in other AWS EC2.

  • T2 – Queries ran from AWS EC2 which will connect AWS Athena with JDBC and retrieved records from the Oracle db in AWS RDS.

  • T3 – Queries ran from AWS EC2 which will connect AWS Athena with JDBC and retrieved records from the Oracle db in AWS RDS.

  • T4 – Queries ran from AWS EC2 which will connect AWS Athena with JDBC and retrieved records from the Oracle db in AWS RDS.

The following graph represents the performance of some of the queries when run from the EC2 instance by connecting AWS Athena.

AWS Athena

Key performance best practice considerations


When considering Athena federation with Oracle Database, you could take into account the following best practices:

  • Athena federation works great for queries with predicate filtering because the predicates are pushed down to Oracle Database. Use filter and limited-range scans in your queries to avoid full table scans.

  • If your SQL query requires returning a large volume of data from Oracle Database to Athena (which could lead to query timeouts or slow performance), unload the large tables in your query from Oracle to your Amazon S3 data lake.

  • Star schema is a commonly used data model in Oracle Database. In the star schema model, unload your large fact tables into your data lake and leave the dimension tables in Oracle Database. If large dimension tables are contributing to slow performance or query timeouts, unload those tables to your data lake.

  • When you run federated queries, Athena spins up multiple Lambda functions, which causes a spike in database connections. It’s important to monitor the Oracle Database WLM queue slots to ensure there is no queuing. Additionally, you can use concurrency scaling on your Oracle Database cluster to benefit from concurrent connections to queue up.

Conclusion


In this post, you learned how to configure and use Athena federation with Oracle Database using Lambda. Now you don’t need to wait for all the data in your Oracle Database data warehouse to be unloaded to Amazon S3 and maintained on a day-to-day basis to run your queries.

You can use the best-practice considerations outlined in the post to minimize the data transferred from Oracle Database for better performance. When queries are well-written for federation, the performance penalties are negligible, as observed in the TPC-DS benchmark queries in this post.

Happy query federating!

×

Book a Demo and
FREE Proof of Value


Contact Trianz to schedule a presentation walkthrough of the Athena AFQ solution and
a free PoV.

By submitting your information, you agree to our revised  Privacy Statement.