SQL query engine architecture was designed to allow users to query a variety of data sources within a single query. While early SQL-based query engines such as Apache Hive allowed analysts to cut through the clutter of analytical data, they found running SQL analytics on multi-petabyte data warehouses to be a time-intensive process that was difficult to visualize and hard to scale.
With the introduction of powerful interactive query engines like Athena Federated Query, users can quickly query multiple sources of data within a serverless, easy-to-use fully managed SQL interface. This process eliminates coding complexity by allowing users to perform cross-data source analytics using familiar SQL constructs for quick analysis, or by using scheduled SQL queries to extract and store results in Amazon S3 for subsequent analysis.
With a federated query, what is being “federated” or brought together are the sources of information that can answer the query. For instance, think of the advantage of being able to look in all the rooms in your house at once for the car keys that have gone missing? Now imagine being able to instantaneously look in every book in a library for an answer — no matter what language the book was written in.
This is akin to how Athena Federated Query connectors have improved database queries and the delivery of business intelligence insights. The connectors provide the ability to connect and query databases across multiple on-prem and public cloud environments at a much faster rate than traditional database management systems (DBMS).
What makes Amazon Athena Federated Query unique is that the tool allows non-technical users to scan multiple sources of data with simple SQL commands. Users can also begin analyzing and reporting on their structured and unstructured data within a single query using the built-in query editor.
In addition, with Athena’s Federated Query connectors, users no longer need to deal with setting up servers, frameworks, or clusters. Instead, users simply select the stored data in their data lake (Amazon S3), define the query, and the AFQ connectors automatically retrieve and aggregate data from specified repositories.
Athena Federated Query Connectors work by using data source connectors that run on AWS Lambda to run federated queries. A data source connector is a piece of code that can translate between the target data source and Athena.
An SQL connector can be thought of as an extension of Athena's query engine. Prebuilt Athena data source connectors allow Amazon data sources like Amazon DynamoDB and JDBC-compliant relational data sources such as MySQL and PostgreSQL to work in unison.
Users can also employ the Athena Query Federation SDK to write custom connectors. This allows them to choose, configure, and deploy a wide range of data source connectors to their account.
After the user deploys the data source connectors, the connector is associated with a catalog that the user can specify in SQL queries. From there, they can combine SQL statements from several catalogs, spanning multiple data sources with a single query.
When the query is submitted against a data source, Athena invokes the corresponding connector to identify parts of the specific tables that need to be read. Athena then manages data parallelism and pushes down filter predicates. Based on the user submitting the query, connectors can provide or restrict access to specific data elements to strengthen user-level security.
The Connectors use Apache Arrow as the format for returning data requested in a query. This enables connectors to be implemented in different programming languages and data formats, such as CSV, TSV, and JSON. It also supports open-source columnar formats and compressed data formats such as Zlib, LZO, and gzip.
Since connectors are processed in an event-driven, serverless computing platform (Lambda), they can be used to access data from any data source that is accessible from Lambda in the cloud or on-prem infrastructure.
The result of using Athena Query Federation is that users no longer need multiple languages to build complex pipelines to extract, transform, and load into a data warehouse before users can query the data.
Imagine an eCommerce store has noticed a spike in customers reporting their orders are not being processed. With key operational metrics and logs residing in multiple places, it would be next to impossible to quickly assess the scope of this problem — or even begin to do root cause analysis before having to refund the order.
By using Athena, the business can use federated queries to quickly find relevant details — even if the analyst is not experienced with each of the individual systems that make up their data architecture. Athena allows the user to run a query and learn how many customer orders are problematic. They can then observe commonalities between the orders to locate the root cause or any relevant information that helps them to investigate further.
In addition, the business can extract all the various warnings and errors from the specified connectors and AWS resources. By running a query search across multiple silos and data stacks, Athena can pull all the relevant criteria to determine what orders need to be processed.
After creating the tables in a matter of seconds, that information can then be easily viewed and used to dive deeper to troubleshoot the problem. What may have taken days without the use of a federated query now takes far less time and technical wrangling.
To learn more about how Athena Federated Query Connectors work under the hood, check out the following video:
As with any data analytics technology, the answer to the SQL query engine question depends on the size and scope of data that needs to be visualized. Some organizations have optimized their database management systems so that SQL query engines are not necessary to pull data across multiple sources. Other organizations have overwhelmed their data warehouses and are seeking new ways to simplify BI and facilitate cross data-source analytics.
This is where the scalability of SQL query engines become a viable alternative to traditional DBMSs. However, it is important to note that while database technologies have been around for decades, SQL query engines are a relatively new technology. If the query engine is not implemented correctly, it may disrupt workloads and lead to suboptimal cost reduction.
While query engines are a great way to visualize analytics for large sets of data, new users should consult an analytics professional who can help them better understand their use cases and business requirements before implementing a big data analytics solution.
If you are interested in learning more about an analytics solution that does not require a complex technology ecosystem, Trianz would be happy to demonstrate various querying and reporting solutions that can improve upon your current development time, accuracy, and performance.
Connecting more people to data has become imperative for organizations worldwide. In Top Trends in Data & Analytics for 2022, Gartner stated, “Connections between diverse and distributed data and people create truly impactful insight and innovation. These connections are critical to assisting humans and machines in making quicker, more accurate, trustworthy, and contextualized decisions while considering an increasing number of factors, stakeholders, and data sources.”Explore
Since the dawn of business, users have looked for three main components when it comes to data: Search | Secure| Share. Now let's talk about the evolution of data over the years. It's a story in itself if one pays attention. Back then, applications were created to handle a set of processes/tasks. These processes/tasks, when grouped logically, became a sub-function, a set of sub-functions constituted a function, and a set of functions made up an enterprise. Phase 1 – Data-AwareExplore
Practitioners in the data realm have gone through various acronyms over the years. It all started with "Decision Support Systems" followed by "Data Warehouse", "Data Marts", "Data Lakes", "Data Fabric", and "Data Mesh", amongst storage formats of RDBMS, MPP, Big Data, Blob, Parquet, Iceberg, etc., and data collection, consolidation, and consumption patterns that have evolved with technology.Explore
Enterprises have, over time, invested in a variety of tools, technologies, and methodologies to solve the critical problem of managing enterprise data assets, be it data catalogs, security policies associated with data access, or encryption/decryption of data (in motion and at rest) or identification of PII, PHI, PCI data. As technology has evolved, so have the tools and methodologies to implement the same. However, the issue continues to persist. There are a variety of reasons for the same:Explore
Application Modernization at Speed and Scale Enterprises are pursuing greater application scalability, cost efficiency, and standardization with containerization and virtualization platforms. So, what’s the difference? Containers are a type of virtualization technology that allows users to run multiple operating systems inside a single instance of an OS. They are lightweight and portable, making them ideal for running applications across different platforms.Explore