This article is the first in a planned series on what it’s like to have Snowflake as your data warehouse/data lake. In teaching “Zero 2 Snowflake” workshops, I have found that people have difficulty wrapping their heads around what it means to have a groundbreaking technology that is truly “designed for the cloud” database.
It’s easy to get involved in the technical details of the architecture and miss the proverbial forest for the trees. This post describes how Snowflake’s architecture spells an end to run ETL at night and on the weekends and how it drastically reduces the time taken for ETL processes to run.
The Snowflake features that help achieve all these are:
Immutable data storage
Pointers to the data
On-demand compute clusters that you only pay for as you use
Separation of data from computing
Performance scales near linearly by increasing the size of your cluster
Why have companies been running ETL processes at night and on the weekend for all these years? The ETL process would run on the same computers running the reports. Also, for traditional databases, writes and reads contend with each other. Nobody wants the ETL process to bog down the system.
Snowflake data is stored in an immutable database. There are only writes. Here, updates are new data that is written while pointers to this data are changed. This enables another great Snowflake feature, time travel, that won’t be addressed in this blog. From an ETL perspective, the lack of contention between the writes and updates allows ETL to be run at any time.
The ability to spin up or resume a compute cluster (Snowflake calls them compute warehouses) at any time and the fact that compute scales independently of storage means that “regular business use” of the data is handled by different compute clusters than ETL. When it’s time to run ETL, all you have to do is spin up a cluster just for the ETL. This takes only seconds, and Snowflake doesn’t charge when you aren’t using a compute cluster.
If an ETL process would take ten hours with a 1cpu “cluster,” it should take five hours with a 2-node cluster, two-and-a-half hours with a 4-node cluster, and so on. The cost of the cluster for use doubles each time as well. But you are only paying for the cluster when you are using it. So, running a ten-hour process on a 1cpu “cluster” costs the same as running a five-minute process on a 128-node cluster. During development, you can test to find the “point of no further advantage” or at what size the cluster is optimum. Though, one can scale without paying for 365x24 costs for that capacity.
Who in ETL hasn’t come to work in the morning only to find that a six-hour ETL processed failed during the night? The task of fixing the bug and then rerunning the load without impacting the business might be impossible. There isn’t time to rerun before the users come in, expecting to run their reports. With Snowflake, you could shorten the time a rerun would take and rerun the process without impacting the users running reports.
This radical ability to scale performance and only pay for what you use has made Snowflake data warehouse “the” place to do all the processing that can be done. In essence, ETL has become ELT – Extract, Load, and then Transform.
You can move your data into Snowflake, then transform it with the massive scale and cost advantages of the Snowflake cloud database. But that’s a story for another blog. For now, living with Snowflake means sparing the ETL developer from working so many nights and weekends.
Contact Us Today
Connecting more people to data has become imperative for organizations worldwide. In Top Trends in Data & Analytics for 2022, Gartner stated, “Connections between diverse and distributed data and people create truly impactful insight and innovation. These connections are critical to assisting humans and machines in making quicker, more accurate, trustworthy, and contextualized decisions while considering an increasing number of factors, stakeholders, and data sources.”Explore
Since the dawn of business, users have looked for three main components when it comes to data: Search | Secure| Share. Now let's talk about the evolution of data over the years. It's a story in itself if one pays attention. Back then, applications were created to handle a set of processes/tasks. These processes/tasks, when grouped logically, became a sub-function, a set of sub-functions constituted a function, and a set of functions made up an enterprise. Phase 1 – Data-AwareExplore
Practitioners in the data realm have gone through various acronyms over the years. It all started with "Decision Support Systems" followed by "Data Warehouse", "Data Marts", "Data Lakes", "Data Fabric", and "Data Mesh", amongst storage formats of RDBMS, MPP, Big Data, Blob, Parquet, Iceberg, etc., and data collection, consolidation, and consumption patterns that have evolved with technology.Explore
Enterprises have, over time, invested in a variety of tools, technologies, and methodologies to solve the critical problem of managing enterprise data assets, be it data catalogs, security policies associated with data access, or encryption/decryption of data (in motion and at rest) or identification of PII, PHI, PCI data. As technology has evolved, so have the tools and methodologies to implement the same. However, the issue continues to persist. There are a variety of reasons for the same:Explore
Finding Hidden Patterns and Correlations Innovative technologies such as artificial intelligence (AI), machine learning (ML) and natural language processing (NLP) are transforming the way we approach data analytics. AI, ML and NLP are categorized under the umbrella term of “cognitive analytics,” which is an approach that leverages human-like computer intelligence to identify hidden patterns and correlations in data.Explore
What Is an SQL Query Engine? SQL query engine architecture was designed to allow users to query a variety of data sources within a single query. While early SQL-based query engines such as Apache Hive allowed analysts to cut through the clutter of analytical data, they found running SQL analytics on multi-petabyte data warehouses to be a time-intensive process that was difficult to visualize and hard to scale.Explore