This post provides guidance on how to configure Amazon Athena federation with AWS Lambda to SAP HANA while addressing performance considerations to ensure proper use.
If you use data lakes in Amazon Simple Storage Service (Amazon S3) and use SAP HANA as your transactional data store and have requirement of joining your data setting in Data Lake on Amazon S3 with SAP HANA running on EC2 or with on-perm SAP HANA database.
Team has a data lake in Amazon S3 and uses Athena. They need access to the data either from SAP HANA database running on EC2 or on perm SAP HANA database.
Analysts using Athena to query their data lake for analytics need agility and flexibility to access data in an SAP HANA database without moving the data to Amazon S3 Data Lake.
In these scenarios, Athena federation with Athena federation feature allows you to seamlessly access the data in your SAP HANA database without having to wait to unload the data to the Amazon S3 data lake, which removes the overhead in managing such jobs.
In this post, you walk through a step-by-step configuration to set up Athena federation using Lambda to access data in SAP HANA database running on EC2.
For the purpose of this blog and demonstrating Athena Federation Query, we will use Trianz SAP HANA Amazon Federation Query Connector. The runtime will include SAP HANA running on EC2. Your SAP HANA instance can be on EC2 or on-premises. You would deploy the Trianz SAP HANA Athena Federated query Connector available in the AWS Serverless Application Repository (SAR)
Let’s start with highlighting the solution and then detailing the steps involved.
Data federation is the capability to integrate data in another data store using a single interface (Athena). The following diagram depicts how Athena federation works by using Lambda to integrate with a federated data source.
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
Athena is easy to use. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds. With Athena, there’s no need for complex ETL jobs to prepare your data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets.
If you have data in sources other than Amazon S3, you can use Athena Federated Query to query the data in place or build pipelines that extract data from multiple data sources and store them in Amazon S3. With Athena Federated Query, you can run SQL queries across data stored in relational, non-relational, object, and custom data sources.
Athena uses data source connectors that run on AWS Lambda to run federated queries. A data source connector is a piece of code that can translate between your target data source and Athena.
When a query is submitted against a data source, Athena invokes the corresponding connector to identify parts of the tables that need to be read, manages parallelism, and pushes down filter predicates. Based on the user submitting the query, connectors can provide or restrict access to specific data elements.
When Athena SQL is combined with Athena federation Query Connectors, an organization can mix and match data from any source. Not only are SQL joins or aggregates possible, but also able to perform ETL using CTAS (Create Table as Select) query.
AWS Lambda lets you run code without provisioning or managing servers. You can run code for virtually any type of application with zero administration and only pay for when the code is running.
SAP HANA is an in-memory, column-oriented and relational database management system. SAP HANA can be connected using code via its Java Database Connectivity (JDBC) drivers.
In Amazon Athena Federation, Athena will route SAP HANA specific queries via Lambda to the target SAP HANA instance. Inside the Lambda, JDBC code will make connection and fetch data and route it back to Athena. If the data does not fit into Lambda RAM runtime memory, it will spill the data to S3 ensuring reliable transmission. Based on the number of partitions in an SAP HANA Table, Athena will try to load the data by running multiple Lambda in parallel.
Before you get started, create a secret for the SAP HANA instance with username and password using AWS Secrets Manager.
On the Secrets Manager console, choose Secrets.
Choose Store a new secret.
Choose the Other types of secrets (e.g., API key) radio button
Set the credentials as key/value pairs (username, password) for your SAP HANA instance.
For Secret name, enter a name for your secret. Use the prefix SAP_HANA_AFQ so it’s easy to find.
Leave the remaining fields at their defaults and choose Next.
Complete your secret creation.
On the Amazon S3 console, create a new S3 bucket and subfolder for Lambda to use. For this post, use the name athena-accelerator/saphana.
On the AWS Serverless Application Repository, choose Available applications.
In the search field, enter TrianzSAPHANAAthenaJDBC..
In the Application settings section, provide the following details:
Application name - TrianzSAPHANAAthenaJDBC
SecretNamePrefix – trianz-saphana-athena-jdbc
SpillBucket – Athena-accelerator/saphana
JDBCConnectorConfig –
Format : saphana://jdbc:sap://{saphana_instance_url}/?${secretname}
Example : saphana://jdbc:sap://54.86.14.206:39015/?${trianz-saphana-athena-jdbc}
DisableSpillEncyption – False
LambdaFunctionName – trsaphana
SecurityGroupID – Security group ID where SAP HANA instance is deployed. Please ensure valid Inbound and Outbound rules are applied based on your connection.
SpillPrefix – Create a folder under the S3 bucket which is created in previous steps and specify the name. Example: athena-spill
Subnetids – Use the subnets where SAP HANA instance is running with comma separation. Ensure the subnet is in VPC and has NAT Gateway and Internet Gateway attached.
Select the I acknowledge check box.
Choose Deploy.
Note: By default, connectors have a 7-day license key. To extend the license key or for help on deploying connectors please reach out to [email protected]
Note: Make sure the Your IAM Role and Lambda Role have permissions to access Serverless Application Repository, CloudFormation, S3, CloudWatch, CloudTrail, Secrets Manager, Lambda, Athena
On the Athena console, choose Workgroup with Athena engine version 2. Athena Federated Query is supported only on Athena engine version
The current engine version for any workgroup can be found in Workgroups page.
Run your queries, using lambda: trsaphana to run against tables in SAP HANA Database.
Several customers have asked us for performance insights and prescriptive guidance on how queries in Athena compare against federated queries, and how to use them.
In this section, we used the COVID19 standard dataset to run results from Athena with SAP HANA connection using SAP HANA instance installed on EC2 instance. This should give you an idea of what to expect when running federated queries against SAP HANA Database from the Athena
sno | Table | Table Size (Records) | Partition table | Application | Total Time | Throughput |
T1 | covid19_bulkdata | 936586 | Non partition table | SAP HANA | 24.56 seconds | 38138 records/second |
T2 | COVID19_HASHHASHPARTITION_BULK | 936586 | Hash Hash partition | SAP HANA | 18.24 seconds | 51347 records/second |
T3 | COVID19_RANGERANGEPARTITION_BULK | 936586 | Range Range Partition | SAP HANA | 24.43 seconds | 38337 records/second |
T4 | COVID19_HASHHASHPARTITION_BULK (SAP HANA) covid19_2020(S3) | 936586 | Range Range Partition | S3/SAP HANA | 13.88 seconds | 67477 records/second |
T5 | COVID19_RANGERANGEPARTITION_BULK (SAP HANA) covid19_2020(S3) | 936586 | Range Range Partition | S3/SAP HANA | 12.01 seconds | 77983 records/second |
T6 | VIEWCOVID119_BULKDATA | 936586 | View with pagination based partition with 300000 records per partition | SAP HANA | 19.12 seconds | 48984 records/second |
T7 | COVID19_RANGERANGEPARTITION_BULK (SAP HANA) VIEWCOVID119_BULKDATA covid19_2020(S3) [union all query] | 936586 | Partition | SAP HANA TABLE/VIEW/S3 | 12.48 seconds | 75046 records/second |
T1 – Queries ran from AFQ and retrieved records from the SAP HANA instance running on EC2.
T2 – Queries ran from AFQ and retrieved records from the SAP HANA instance running on EC2.
T3 – Queries ran from AFQ and retrieved records from the SAP HANA instance running on EC2.
T4 – Queries ran from AFQ and retrieved records from the SAP HANA instance running on EC2 and S3.
T5 – Queries ran from AFQ and retrieved records from the SAP HANA instance running on EC2 and S3.
T6 – Queries ran from AFQ and retrieved records from the SAP HANA instance running on EC2 and S3.
T7 – Queries ran from AFQ and retrieved records from the SAP HANA instance running on EC2 and S3.
The following graph represents the performance of some of the queries when run from the EC2 instance by connecting Amazon Athena.
Athena federation works great for queries with predicate filtering because the predicates are pushed down to SAP HANA Database. Use filter and limited-range scans in your queries to avoid full table scans.
If your SQL query requires returning a large volume of data from SAP HANA Database to Athena (which could lead to query timeouts or slow performance), unload the large tables in your query from SAP HANA to your Amazon S3 data lake.
Star schema is a commonly used data model in SAP HANA Database. In the star schema model, unload your large fact tables into your data lake and leave the dimension tables in SAP HANA Database. If large dimension tables are contributing to slow performance or query timeouts, unload those tables to your data lake.
When you run federated queries, Athena spins up multiple Lambda functions, which causes a spike in database connections. It’s important to monitor the SAP HANA Database WLM queue slots to ensure there is no queuing. Additionally, you can use concurrency scaling on your SAP HANA Database cluster to benefit from concurrent connections to queue up.
In this post, you learned how to configure and use Athena federation with SAP HANA Database using Lambda. Now you don’t need to wait for all the data in your SAP HANA Database to be unloaded to Amazon S3 and maintained on a day-to-day basis to run your queries.
You can use the best-practice considerations outlined in the post to minimize the data transferred from SAP HANA Database for better performance. When queries are well-written for federation, the performance penalties are negligible, as observed in the benchmark queries in this post.