SQL query engine architecture was designed to allow users to query a variety of data sources within a single query. While early SQL-based query engines such as Apache Hive allowed analysts to cut through the clutter of analytical data, they found running SQL analytics on multi-petabyte data warehouses to be a time-intensive process that was difficult to visualize and hard to scale.
With the introduction of powerful interactive query engines like Athena Federated Query, users can quickly query multiple sources of data within a serverless, easy-to-use fully managed SQL interface. This process eliminates coding complexity by allowing users to perform cross-data source analytics using familiar SQL constructs for quick analysis, or by using scheduled SQL queries to extract and store results in Amazon S3 for subsequent analysis.
With a federated query, what is being “federated” or brought together are the sources of information that can answer the query. For instance, think of the advantage of being able to look in all the rooms in your house at once for the car keys that have gone missing? Now imagine being able to instantaneously look in every book in a library for an answer — no matter what language the book was written in.
This is akin to how Athena Federated Query connectors have improved database queries and the delivery of business intelligence insights. The connectors provide the ability to connect and query databases across multiple on-prem and public cloud environments at a much faster rate than traditional database management systems (DBMS).
What makes Amazon Athena Federated Query unique is that the tool allows non-technical users to scan multiple sources of data with simple SQL commands. Users can also begin analyzing and reporting on their structured and unstructured data within a single query using the built-in query editor.
In addition, with Athena’s Federated Query connectors, users no longer need to deal with setting up servers, frameworks, or clusters. Instead, users simply select the stored data in their data lake (Amazon S3), define the query, and the AFQ connectors automatically retrieve and aggregate data from specified repositories.
Athena Federated Query Connectors work by using data source connectors that run on AWS Lambda to run federated queries. A data source connector is a piece of code that can translate between the target data source and Athena.
An SQL connector can be thought of as an extension of Athena's query engine. Prebuilt Athena data source connectors allow Amazon data sources like Amazon DynamoDB and JDBC-compliant relational data sources such as MySQL and PostgreSQL to work in unison.
Users can also employ the Athena Query Federation SDK to write custom connectors. This allows them to choose, configure, and deploy a wide range of data source connectors to their account.
After the user deploys the data source connectors, the connector is associated with a catalog that the user can specify in SQL queries. From there, they can combine SQL statements from several catalogs, spanning multiple data sources with a single query.
When the query is submitted against a data source, Athena invokes the corresponding connector to identify parts of the specific tables that need to be read. Athena then manages data parallelism and pushes down filter predicates. Based on the user submitting the query, connectors can provide or restrict access to specific data elements to strengthen user-level security.
The Connectors use Apache Arrow as the format for returning data requested in a query. This enables connectors to be implemented in different programming languages and data formats, such as CSV, TSV, and JSON. It also supports open-source columnar formats and compressed data formats such as Zlib, LZO, and gzip.
Since connectors are processed in an event-driven, serverless computing platform (Lambda), they can be used to access data from any data source that is accessible from Lambda in the cloud or on-prem infrastructure.
The result of using Athena Query Federation is that users no longer need multiple languages to build complex pipelines to extract, transform, and load into a data warehouse before users can query the data.
Imagine an eCommerce store has noticed a spike in customers reporting their orders are not being processed. With key operational metrics and logs residing in multiple places, it would be next to impossible to quickly assess the scope of this problem — or even begin to do root cause analysis before having to refund the order.
By using Athena, the business can use federated queries to quickly find relevant details — even if the analyst is not experienced with each of the individual systems that make up their data architecture. Athena allows the user to run a query and learn how many customer orders are problematic. They can then observe commonalities between the orders to locate the root cause or any relevant information that helps them to investigate further.
In addition, the business can extract all the various warnings and errors from the specified connectors and AWS resources. By running a query search across multiple silos and data stacks, Athena can pull all the relevant criteria to determine what orders need to be processed.
After creating the tables in a matter of seconds, that information can then be easily viewed and used to dive deeper to troubleshoot the problem. What may have taken days without the use of a federated query now takes far less time and technical wrangling.
To learn more about how Athena Federated Query Connectors work under the hood, check out the following video:
As with any data analytics technology, the answer to the SQL query engine question depends on the size and scope of data that needs to be visualized. Some organizations have optimized their database management systems so that SQL query engines are not necessary to pull data across multiple sources. Other organizations have overwhelmed their data warehouses and are seeking new ways to simplify BI and facilitate cross data-source analytics.
This is where the scalability of SQL query engines become a viable alternative to traditional DBMSs. However, it is important to note that while database technologies have been around for decades, SQL query engines are a relatively new technology. If the query engine is not implemented correctly, it may disrupt workloads and lead to suboptimal cost reduction.
While query engines are a great way to visualize analytics for large sets of data, new users should consult an analytics professional who can help them better understand their use cases and business requirements before implementing a big data analytics solution.
If you are interested in learning more about an analytics solution that does not require a complex technology ecosystem, Trianz would be happy to demonstrate various querying and reporting solutions that can improve upon your current development time, accuracy, and performance.
A Winning Base for Successful Digital Transformations When it comes to developing a successful digital strategy, it is not just corporations planning to maximize the benefits of data assets and technology-focused initiatives. The Government of Western Australia recently unveiled four key priorities for digital reform in its new Digital Strategy for 2021-2025.Explore
Engage Your Workforce with a Modern Employee Intranet Solution The employee intranet has changed significantly since it was first introduced in the early 1990s. What started as HTML-based static portals have now evolved into intuitive communication tools complete with search engines, user profiles, blogs, event planners, and more. Today, many organizations are taking a second look at employee intranets to bridge gaps between teams, build company culture, centralize information, increase productivity, and improve workflow.Explore
Adopting emerging cloud technologies, consolidating resources, and improving processes is the key. “IT no longer just supports corporate operations as it traditionally has but is fully participating in business value delivery. Not only does this shift IT from a back-office role to the front of business, but it also changes the source of funding from an overhead expense that is maintained, monitored, and sometimes cut, to the thing that drives revenue,” said John-David Lovelock, research vice president at Gartner.Explore
Deliver Powerful Insights Instantaneously with Federated Queries - No Matter Where Your Data Resides The concept of federated queries isn’t new. Facebook PrestoDB popularized the idea of distributed structured query language (SQL) query engines in 2013. Over the years, AWS, Google, Microsoft, and many others in the industry have accelerated the adoption of a distributed query engine model within their products. For example, AWS developed Amazon Athena on top of the Presto code base, while Google’s BigQuery is based on Cloud SQL.Explore
What is Unstructured Data? Almost 80% of the data that enterprises and organizations collect is unstructured - data without a set record format or structure. Unstructured data includes data such as emails, web pages, PDFs, documents, customer feedback, in-app reviews, social media, video files, audio files, and images.Explore
The Best of Both Worlds A data lakehouse is a new, open data management architecture designed to combine the analytic benefits of a data warehouse and a data lake. By leveraging the machine learning capabilities of a data lake combined with the support of a data warehouse’s BI insights, the lakehouse approach can address data staleness, reliability, scalability, data lock-in, and limited use-case support.Explore