SQL query engine architecture was designed to allow users to query a variety of data sources within a single query. While early SQL-based query engines such as Apache Hive allowed analysts to cut through the clutter of analytical data, they found running SQL analytics on multi-petabyte data warehouses to be a time-intensive process that was difficult to visualize and hard to scale.
With the introduction of powerful interactive query engines like Athena Federated Query, users can quickly query multiple sources of data within a serverless, easy-to-use fully managed SQL interface. This process eliminates coding complexity by allowing users to perform cross-data source analytics using familiar SQL constructs for quick analysis, or by using scheduled SQL queries to extract and store results in Amazon S3 for subsequent analysis.
With a federated query, what is being “federated” or brought together are the sources of information that can answer the query. For instance, think of the advantage of being able to look in all the rooms in your house at once for the car keys that have gone missing? Now imagine being able to instantaneously look in every book in a library for an answer — no matter what language the book was written in.
This is akin to how Athena Federated Query connectors have improved database queries and the delivery of business intelligence insights. The connectors provide the ability to connect and query databases across multiple on-prem and public cloud environments at a much faster rate than traditional database management systems (DBMS).
What makes Amazon Athena Federated Query unique is that the tool allows non-technical users to scan multiple sources of data with simple SQL commands. Users can also begin analyzing and reporting on their structured and unstructured data within a single query using the built-in query editor.
In addition, with Athena’s Federated Query connectors, users no longer need to deal with setting up servers, frameworks, or clusters. Instead, users simply select the stored data in their data lake (Amazon S3), define the query, and the AFQ connectors automatically retrieve and aggregate data from specified repositories.
Athena Federated Query Connectors work by using data source connectors that run on AWS Lambda to run federated queries. A data source connector is a piece of code that can translate between the target data source and Athena.
An SQL connector can be thought of as an extension of Athena's query engine. Prebuilt Athena data source connectors allow Amazon data sources like Amazon DynamoDB and JDBC-compliant relational data sources such as MySQL and PostgreSQL to work in unison.
Users can also employ the Athena Query Federation SDK to write custom connectors. This allows them to choose, configure, and deploy a wide range of data source connectors to their account.
After the user deploys the data source connectors, the connector is associated with a catalog that the user can specify in SQL queries. From there, they can combine SQL statements from several catalogs, spanning multiple data sources with a single query.
When the query is submitted against a data source, Athena invokes the corresponding connector to identify parts of the specific tables that need to be read. Athena then manages data parallelism and pushes down filter predicates. Based on the user submitting the query, connectors can provide or restrict access to specific data elements to strengthen user-level security.
The Connectors use Apache Arrow as the format for returning data requested in a query. This enables connectors to be implemented in different programming languages and data formats, such as CSV, TSV, and JSON. It also supports open-source columnar formats and compressed data formats such as Zlib, LZO, and gzip.
Since connectors are processed in an event-driven, serverless computing platform (Lambda), they can be used to access data from any data source that is accessible from Lambda in the cloud or on-prem infrastructure.
The result of using Athena Query Federation is that users no longer need multiple languages to build complex pipelines to extract, transform, and load into a data warehouse before users can query the data.
Imagine an eCommerce store has noticed a spike in customers reporting their orders are not being processed. With key operational metrics and logs residing in multiple places, it would be next to impossible to quickly assess the scope of this problem — or even begin to do root cause analysis before having to refund the order.
By using Athena, the business can use federated queries to quickly find relevant details — even if the analyst is not experienced with each of the individual systems that make up their data architecture. Athena allows the user to run a query and learn how many customer orders are problematic. They can then observe commonalities between the orders to locate the root cause or any relevant information that helps them to investigate further.
In addition, the business can extract all the various warnings and errors from the specified connectors and AWS resources. By running a query search across multiple silos and data stacks, Athena can pull all the relevant criteria to determine what orders need to be processed.
After creating the tables in a matter of seconds, that information can then be easily viewed and used to dive deeper to troubleshoot the problem. What may have taken days without the use of a federated query now takes far less time and technical wrangling.
To learn more about how Athena Federated Query Connectors work under the hood, check out the following video:
As with any data analytics technology, the answer to the SQL query engine question depends on the size and scope of data that needs to be visualized. Some organizations have optimized their database management systems so that SQL query engines are not necessary to pull data across multiple sources. Other organizations have overwhelmed their data warehouses and are seeking new ways to simplify BI and facilitate cross data-source analytics.
This is where the scalability of SQL query engines become a viable alternative to traditional DBMSs. However, it is important to note that while database technologies have been around for decades, SQL query engines are a relatively new technology. If the query engine is not implemented correctly, it may disrupt workloads and lead to suboptimal cost reduction.
While query engines are a great way to visualize analytics for large sets of data, new users should consult an analytics professional who can help them better understand their use cases and business requirements before implementing a big data analytics solution.
If you are interested in learning more about an analytics solution that does not require a complex technology ecosystem, Trianz would be happy to demonstrate various querying and reporting solutions that can improve upon your current development time, accuracy, and performance.
Making Data More Accessible For many years, data models have plagued data scientists and analysts with inefficiency that eroded the usefulness of their organizational data. While solutions such as data lakes and data warehouses create a central repository for organizational data, they often lack the agility to deliver the complex data insights required to power a modern enterprise.Explore
Breaking Down the Walls Every organization deals with data in one way or another—whether in a database, data warehouse, or other architecture type. With this data comes a management burden, as customer data must be protected in line with data regulations. IT teams struggle with data pipelines: controlling access to datasets across numerous products, services, and business applications. Improper data governance and security configurations can prevent data access entirely and leave data in the wrong internal or external hands.Explore
Putting Data to Work Recently, one of the world’s largest global shipping companies was seeking to identify new revenue opportunities; specifically, they were interested in monetizing their data by building other, related business intelligence products for different industries. Like many other businesses, they had found themselves sitting on a mountain of actionable data without any processes in place to explore or leverage said data. Their intentions were now pointed in the right direction, but what they were missing was a data monetization strategy.Explore
The Data Tide Businesses in the digital age are inundated with data as it floods in from multiple channels. This data is both a challenge to wade through and an absolute goldmine. Its tremendous potential can be harnessed to communicate meaningfully with audiences and advance an organization’s brand awareness in the public eye. The problem is, however, that raw data itself can’t tell a compelling story to most people. It needs to be woven together artfully to create a narrative that connects with a specific audience. This is where data-driven storytelling comes in.Explore