SQL query engine architecture was designed to allow users to query a variety of data sources within a single query. While early SQL-based query engines such as Apache Hive allowed analysts to cut through the clutter of analytical data, they found running SQL analytics on multi-petabyte data warehouses to be a time-intensive process that was difficult to visualize and hard to scale.
With the introduction of powerful interactive query engines like Athena Federated Query, users can quickly query multiple sources of data within a serverless, easy-to-use fully managed SQL interface. This process eliminates coding complexity by allowing users to perform cross-data source analytics using familiar SQL constructs for quick analysis, or by using scheduled SQL queries to extract and store results in Amazon S3 for subsequent analysis.
With a federated query, what is being “federated” or brought together are the sources of information that can answer the query. For instance, think of the advantage of being able to look in all the rooms in your house at once for the car keys that have gone missing? Now imagine being able to instantaneously look in every book in a library for an answer — no matter what language the book was written in.
This is akin to how Athena Federated Query connectors have improved database queries and the delivery of business intelligence insights. The connectors provide the ability to connect and query databases across multiple on-prem and public cloud environments at a much faster rate than traditional database management systems (DBMS).
What makes Amazon Athena Federated Query unique is that the tool allows non-technical users to scan multiple sources of data with simple SQL commands. Users can also begin analyzing and reporting on their structured and unstructured data within a single query using the built-in query editor.
In addition, with Athena’s Federated Query connectors, users no longer need to deal with setting up servers, frameworks, or clusters. Instead, users simply select the stored data in their data lake (Amazon S3), define the query, and the AFQ connectors automatically retrieve and aggregate data from specified repositories.
Athena Federated Query Connectors work by using data source connectors that run on AWS Lambda to run federated queries. A data source connector is a piece of code that can translate between the target data source and Athena.
An SQL connector can be thought of as an extension of Athena's query engine. Prebuilt Athena data source connectors allow Amazon data sources like Amazon DynamoDB and JDBC-compliant relational data sources such as MySQL and PostgreSQL to work in unison.
Users can also employ the Athena Query Federation SDK to write custom connectors. This allows them to choose, configure, and deploy a wide range of data source connectors to their account.
After the user deploys the data source connectors, the connector is associated with a catalog that the user can specify in SQL queries. From there, they can combine SQL statements from several catalogs, spanning multiple data sources with a single query.
When the query is submitted against a data source, Athena invokes the corresponding connector to identify parts of the specific tables that need to be read. Athena then manages data parallelism and pushes down filter predicates. Based on the user submitting the query, connectors can provide or restrict access to specific data elements to strengthen user-level security.
The Connectors use Apache Arrow as the format for returning data requested in a query. This enables connectors to be implemented in different programming languages and data formats, such as CSV, TSV, and JSON. It also supports open-source columnar formats and compressed data formats such as Zlib, LZO, and gzip.
Since connectors are processed in an event-driven, serverless computing platform (Lambda), they can be used to access data from any data source that is accessible from Lambda in the cloud or on-prem infrastructure.
The result of using Athena Query Federation is that users no longer need multiple languages to build complex pipelines to extract, transform, and load into a data warehouse before users can query the data.
Imagine an eCommerce store has noticed a spike in customers reporting their orders are not being processed. With key operational metrics and logs residing in multiple places, it would be next to impossible to quickly assess the scope of this problem — or even begin to do root cause analysis before having to refund the order.
By using Athena, the business can use federated queries to quickly find relevant details — even if the analyst is not experienced with each of the individual systems that make up their data architecture. Athena allows the user to run a query and learn how many customer orders are problematic. They can then observe commonalities between the orders to locate the root cause or any relevant information that helps them to investigate further.
In addition, the business can extract all the various warnings and errors from the specified connectors and AWS resources. By running a query search across multiple silos and data stacks, Athena can pull all the relevant criteria to determine what orders need to be processed.
After creating the tables in a matter of seconds, that information can then be easily viewed and used to dive deeper to troubleshoot the problem. What may have taken days without the use of a federated query now takes far less time and technical wrangling.
To learn more about how Athena Federated Query Connectors work under the hood, check out the following video:
As with any data analytics technology, the answer to the SQL query engine question depends on the size and scope of data that needs to be visualized. Some organizations have optimized their database management systems so that SQL query engines are not necessary to pull data across multiple sources. Other organizations have overwhelmed their data warehouses and are seeking new ways to simplify BI and facilitate cross data-source analytics.
This is where the scalability of SQL query engines become a viable alternative to traditional DBMSs. However, it is important to note that while database technologies have been around for decades, SQL query engines are a relatively new technology. If the query engine is not implemented correctly, it may disrupt workloads and lead to suboptimal cost reduction.
While query engines are a great way to visualize analytics for large sets of data, new users should consult an analytics professional who can help them better understand their use cases and business requirements before implementing a big data analytics solution.
If you are interested in learning more about an analytics solution that does not require a complex technology ecosystem, Trianz would be happy to demonstrate various querying and reporting solutions that can improve upon your current development time, accuracy, and performance.
Application Modernization at Speed and Scale Enterprises are pursuing greater application scalability, cost efficiency, and standardization with containerization and virtualization platforms. So, what’s the difference? Containers are a type of virtualization technology that allows users to run multiple operating systems inside a single instance of an OS. They are lightweight and portable, making them ideal for running applications across different platforms.Explore
Container Orchestration or Compute Service? Amazon Web Services (AWS) offers a range of cloud computing services to meet enterprise needs. Included in its service offering is the elastic compute service (ECS) and elastic compute cloud (EC2). Choosing between these two services can be difficult, as one focuses on virtualization while the other manages containerization. In the following article, we will explore the differences between Amazon ECS and EC2 to help you better understand which service is right for your use case.Explore
What is Application Modernization? Application modernization is the process of converting, rewriting, or porting legacy software packages to operate more efficiently with a modern infrastructure. This can involve migrating to the cloud, creating apps with a serverless architecture, containerizing services, or overhauling data pipelines using a modern DevOps model.Explore
What are the Differences? Though often used interchangeably, data pipelines and ETL are two different methodologies for managing and structuring data. ETL tools are used for data extraction, transformation, and loading. Whereas data pipelines encompass the entire set of processes applied to data as it moves from one system to another. Sometimes data pipelines involve transformation, and sometimes they do not.Explore