Federated SAP HANA database using Athena Federated Query and Join with data in Your AWS Data Lake (Amazon S3)

This post provides guidance on how to configure Amazon Athena federation with AWS Lambda to SAP HANA while addressing performance considerations to ensure proper use.

If you use data lakes in Amazon Simple Storage Service (Amazon S3) and use SAP HANA as your transactional data store and have requirement of joining your data setting in Data Lake on Amazon S3 with SAP HANA running on EC2 or with on-perm SAP HANA database.

  • Team has a data lake in Amazon S3 and uses Athena. They need access to the data either from SAP HANA database running on EC2 or on perm SAP HANA database.

  • Analysts using Athena to query their data lake for analytics need agility and flexibility to access data in an SAP HANA database without moving the data to Amazon S3 Data Lake.

In these scenarios, Athena federation with Athena federation feature allows you to seamlessly access the data in your SAP HANA database without having to wait to unload the data to the Amazon S3 data lake, which removes the overhead in managing such jobs.

In this post, you walk through a step-by-step configuration to set up Athena federation using Lambda to access data in SAP HANA database running on EC2.

For the purpose of this blog and demonstrating Athena Federation Query, we will use Trianz SAP HANA Amazon Federation Query Connector. The runtime will include SAP HANA running on EC2. Your SAP HANA instance can be on EC2 or on-premises. You would deploy the Trianz SAP HANA Athena Federated query Connector available in the AWS Serverless Application Repository (SAR)

Let’s start with highlighting the solution and then detailing the steps involved.

Solution overview


Data federation is the capability to integrate data in another data store using a single interface (Athena). The following diagram depicts how Athena federation works by using Lambda to integrate with a federated data source.

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Athena is easy to use. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds. With Athena, there’s no need for complex ETL jobs to prepare your data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets.

If you have data in sources other than Amazon S3, you can use Athena Federated Query to query the data in place or build pipelines that extract data from multiple data sources and store them in Amazon S3. With Athena Federated Query, you can run SQL queries across data stored in relational, non-relational, object, and custom data sources.

Athena uses data source connectors that run on AWS Lambda to run federated queries. A data source connector is a piece of code that can translate between your target data source and Athena.

Solution overview

When a query is submitted against a data source, Athena invokes the corresponding connector to identify parts of the tables that need to be read, manages parallelism, and pushes down filter predicates. Based on the user submitting the query, connectors can provide or restrict access to specific data elements.

When Athena SQL is combined with Athena federation Query Connectors, an organization can mix and match data from any source. Not only are SQL joins or aggregates possible, but also able to perform ETL using CTAS (Create Table as Select) query.

AWS Lambda lets you run code without provisioning or managing servers. You can run code for virtually any type of application with zero administration and only pay for when the code is running.

SAP HANA is an in-memory, column-oriented and relational database management system. SAP HANA can be connected using code via its Java Database Connectivity (JDBC) drivers.

In Amazon Athena Federation, Athena will route SAP HANA specific queries via Lambda to the target SAP HANA instance. Inside the Lambda, JDBC code will make connection and fetch data and route it back to Athena. If the data does not fit into Lambda RAM runtime memory, it will spill the data to S3 ensuring reliable transmission. Based on the number of partitions in an SAP HANA Table, Athena will try to load the data by running multiple Lambda in parallel.

Prerequisites


Before you get started, create a secret for the SAP HANA instance with username and password using AWS Secrets Manager.

  1. On the Secrets Manager console, choose Secrets.

  2. Choose Store a new secret.

  3. Choose the Other types of secrets (e.g., API key) radio button

  4. Set the credentials as key/value pairs (username, password) for your SAP HANA instance.

  5. For Secret name, enter a name for your secret. Use the prefix SAP_HANA_AFQ so it’s easy to find.

  6. Leave the remaining fields at their defaults and choose Next.

  7. Complete your secret creation.

Prerequisites

Setting up your S3 bucket


On the Amazon S3 console, create a new S3 bucket and subfolder for Lambda to use. For this post, use the name athena-accelerator/saphana.

Setting up your S3 bucket

Configuring Athena federation with SAP HANA instance


To configure Athena federation with SAP HANA instance, complete the following steps:

  1. On the AWS Serverless Application Repository, choose Available applications.

  2. In the search field, enter TrianzSAPHANAAthenaJDBC..

    Configuring Athena federation with Oracle
  3. In the Application settings section, provide the following details:

  4. Application name - TrianzSAPHANAAthenaJDBC

  5. SecretNamePrefix – trianz-saphana-athena-jdbc

  6. SpillBucket – Athena-accelerator/saphana

  7. JDBCConnectorConfig –
    Format : saphana://jdbc:sap://{saphana_instance_url}/?${secretname}
    Example : saphana://jdbc:sap://54.86.14.206:39015/?${trianz-saphana-athena-jdbc}

  8. DisableSpillEncyption – False

  9. LambdaFunctionName – trsaphana

  10. SecurityGroupID – Security group ID where SAP HANA instance is deployed. Please ensure valid Inbound and Outbound rules are applied based on your connection.

  11. SpillPrefix – Create a folder under the S3 bucket which is created in previous steps and specify the name. Example: athena-spill

  12. Subnetids – Use the subnets where SAP HANA instance is running with comma separation. Ensure the subnet is in VPC and has NAT Gateway and Internet Gateway attached.

  13. Select the I acknowledge check box.

  14. Choose Deploy.

Note: By default, connectors have a 7-day license key. To extend the license key or for help on deploying connectors please reach out to [email protected]

Note: Make sure the Your IAM Role and Lambda Role have permissions to access Serverless Application Repository, CloudFormation, S3, CloudWatch, CloudTrail, Secrets Manager, Lambda, Athena

Running federated queries with Athena


To start running federated queries, complete the following steps:

  1. On the Athena console, choose Workgroup with Athena engine version 2. Athena Federated Query is supported only on Athena engine version

  2. The current engine version for any workgroup can be found in Workgroups page.

  3. Run your queries, using lambda: trsaphana to run against tables in SAP HANA Database.

Running federated queries with Athena

Joining the date set b/w SAP HANA and Data Lake

Joining the date set b/w SAP HANA and Data Lake

Athena query performance comparison


Several customers have asked us for performance insights and prescriptive guidance on how queries in Athena compare against federated queries, and how to use them.

In this section, we used the COVID19 standard dataset to run results from Athena with SAP HANA connection using SAP HANA instance installed on EC2 instance. This should give you an idea of what to expect when running federated queries against SAP HANA Database from the Athena

The following table summarizes the dataset sizes:

sno

Table

Table Size (Records)

Partition table

Application

Total Time

Throughput

T1covid19_bulkdata936586Non partition tableSAP HANA24.56 seconds38138 records/second
T2COVID19_HASHHASHPARTITION_BULK936586Hash Hash partitionSAP HANA18.24 seconds51347 records/second
T3COVID19_RANGERANGEPARTITION_BULK936586Range Range PartitionSAP HANA24.43 seconds38337 records/second
T4COVID19_HASHHASHPARTITION_BULK (SAP HANA) covid19_2020(S3)936586Range Range PartitionS3/SAP HANA13.88 seconds67477 records/second
T5COVID19_RANGERANGEPARTITION_BULK (SAP HANA) covid19_2020(S3)936586Range Range PartitionS3/SAP HANA12.01 seconds77983 records/second
T6VIEWCOVID119_BULKDATA936586View with pagination based partition with 300000 records per partitionSAP HANA19.12 seconds48984 records/second
T7COVID19_RANGERANGEPARTITION_BULK (SAP HANA) VIEWCOVID119_BULKDATA covid19_2020(S3) [union all query]936586PartitionSAP HANA TABLE/VIEW/S312.48 seconds75046 records/second

We executed following tests:

  • T1 – Queries ran from AFQ and retrieved records from the SAP HANA instance running on EC2.

  • T2 – Queries ran from AFQ and retrieved records from the SAP HANA instance running on EC2.

  • T3 – Queries ran from AFQ and retrieved records from the SAP HANA instance running on EC2.

  • T4 – Queries ran from AFQ and retrieved records from the SAP HANA instance running on EC2 and S3.

  • T5 – Queries ran from AFQ and retrieved records from the SAP HANA instance running on EC2 and S3.

  • T6 – Queries ran from AFQ and retrieved records from the SAP HANA instance running on EC2 and S3.

  • T7 – Queries ran from AFQ and retrieved records from the SAP HANA instance running on EC2 and S3.

The following graph represents the performance of some of the queries when run from the EC2 instance by connecting Amazon Athena.

AWS Athena

Key performance best practice considerations


When considering Athena federation with SAP HANA Database, you could take into account the following best practices:

  • Athena federation works great for queries with predicate filtering because the predicates are pushed down to SAP HANA Database. Use filter and limited-range scans in your queries to avoid full table scans.

  • If your SQL query requires returning a large volume of data from SAP HANA Database to Athena (which could lead to query timeouts or slow performance), unload the large tables in your query from SAP HANA to your Amazon S3 data lake.

  • Star schema is a commonly used data model in SAP HANA Database. In the star schema model, unload your large fact tables into your data lake and leave the dimension tables in SAP HANA Database. If large dimension tables are contributing to slow performance or query timeouts, unload those tables to your data lake.

  • When you run federated queries, Athena spins up multiple Lambda functions, which causes a spike in database connections. It’s important to monitor the SAP HANA Database WLM queue slots to ensure there is no queuing. Additionally, you can use concurrency scaling on your SAP HANA Database cluster to benefit from concurrent connections to queue up.

Conclusion


In this post, you learned how to configure and use Athena federation with SAP HANA Database using Lambda. Now you don’t need to wait for all the data in your SAP HANA Database to be unloaded to Amazon S3 and maintained on a day-to-day basis to run your queries.

You can use the best-practice considerations outlined in the post to minimize the data transferred from SAP HANA Database for better performance. When queries are well-written for federation, the performance penalties are negligible, as observed in the benchmark queries in this post.

×

Book a Demo and
FREE Proof of Value


Contact Trianz to schedule a presentation walkthrough of the Athena AFQ solution and
a free PoV.

By submitting your information, you agree to our revised  Privacy Statement.