AWS Glue DataBrew is a new visual data preparation tool that helps enterprises analyze data by cleaning, normalizing, and structuring datasets up to 80% faster than traditional data preparation tasks. It can interface with Amazon S3, S3 buckets, AWS data lakes, Aurora PostgreSQL, RedShift tables, Snowflake, and many other data sources.
Being that AWS Glue DataBrew is a no-code ETL tool, it provides data analysts and data scientists with self-service access to perform data preparation, without writing a single line of code. As a result, a workflow typically reserved for engineers and ETL developers is accessible and usable by anyone with knowledge of SQL.
Users can choose from up to 250 built-in transformations to clean, combine, pivot, and transpose terabytes of data directly from their data warehouse, data lake, or database platforms using a simple interactive interface. It also allows the data to be ready for immediate use in analytics and machine learning.
Learn more: AWS Glue Implementation Services
Since enterprises host vast quantities of data in a variety of formats, it is often a time-consuming and resource-intensive process to transform data into actionable insights. With AWS Glue DataBrew, users are provided with a visual interface to make sense of the following data formats:
Raw, Unstructured Data
This can include system logs, sensor data, text files, and so forth. Raw data is generated as-is, without further processing. It has no model or structure. Raw data is suited to platforms like AWS data lakes, which can house and load large unstructured data sets.
An application or service may output data with a basic model or structure. This includes Excel files, which use the comma-separated values (CSV) and columns or rows to form a hierarchy.
Structured data could include a combined list of first and last names, dates, zip codes, or payment card information. Enterprises can query structured data using SQL, NoSQL, or any other data language.
Since AWS Glue DataBrew is used to clean, normalize, and structure datasets, all the raw, unstructured data, and semi-structured data can be loaded and transformed into structured data. This allows analysts to quickly shape the data for their analytics solutions.
As you can see, DataBrew has the potential to help enterprises prepare and transform large datasets into structured formats for analytical insight. Let’s explore the top four features of this visual data preparation tool:
Visualized Data Preparation
Where most data are viewed as alphanumeric values in columnar databases, DataBrew works different. It visualizes all loaded data sources so you can understand the hierarchy and data relations. On the topic of no code, the graphical user interface (GUI) allows users to click, drag, and edit data without a single line of code
250+ Data Preparation Automations
Data scientists follow several isolated, repeatable workflows and processes as part of their role. AWS has modeled these processes and workflows as language and data-agnostic modules, creating a library of actions that are usable by end users. This can include creating pivot tables, handling missing data values, data filters, and data encoding. Then, you can create a DataBrew recipe to create singular or batch action automation templates.
Much like audit logs are used to track customer activities in an IT network, data lineage allows you to track data transformation activities within AWS Glue DataBrew. This includes the data source, any transformations that have been applied, and the data output including target location. With transformations, it covers how data was transformed, what changed, and why it was transformed, so that everyone can interpret the chain of events. Much like a family tree, this allows you to go back and understand how the data came to be.
Trying to find the similarities between two databases? Databrew helps you find matching fields that are present in two data sources. As matching fields are identified, they are loaded into a schema. This enables the creation of a homogenous, centralized database that condenses multiple data sources into one.
AWS Glue DataBrew is focused on improving the accessibility and usability of data preparation tasks. Here are a few benefits and advantages of using DataBrew in more detail:
Reduced Barrier of Entry to Data Preparation
The no-code format of the platform means anyone, of any skillset, can prepare data. While functionality is not advanced enough to replace data engineers, it allows non-technical employees to perform an ETL process, data exploration, and data quality workflows much like data analysts would. This reduces the barrier of entry to data preparation, which directly tackles the shortage of data science personnel across all industries.
Automated Data Profile Creation
As you work on projects in DataBrew, you will be shown a dataset preview, profile overview, column or row statistics, and a full data lineage. This falls under a data profile, giving you a snapshot overview of the new dataset.
Automate 250+ Common Data Preparation Processes
Using models for data preparation can significantly reduce the amount of effort required. Rather than writing a line of code, or copying a code template and editing the parameters, you can simply configure and execute an action to prepare data. This is known as workflow enhancement, speeding up low-level repeatable processes to get more done in less time.
Intelligent Prescriptive Suggestions
Glue DataBrew incorporates light prescriptive analytics while you work. This means you get suggestions to speed up preprocessing workflows, reducing utilization costs with a faster time to benefit (FTTB).
While AWS Glue DataBrew transforms the accessibility and usability of data for business users, it also has the following limitations:
No Code Means No Code
DataBrew uses standardized, tested modules that write the code for you while allowing editing of parameters through a GUI. When AWS says “no code,” it really does mean no code. However, you cannot introduce custom code to this platform, as it is heavily locked down to the 250+ data transformation actions. For advanced technical users, this may be a deal breaker. But most non-technical users will appreciate the no-code nature of DataBrew.
Cloud Only Leads to No Offline Support
Since AWS is a cloud service provider, naturally there is no offline support. It is web only, accessible through a web application. This means that on-premises data centers cannot use their LAN datasets in DataBrew if broadband connections fail — a complication for hybrid cloud users.
No Custom Visualizations
The no-code nature of the platform means visual interface customizations of any sort are scarce. You are limited to DataBrew visualizations, such as basic graphs, charts, and bars, rather than integrating third-party services like PowerBI, Tableau, or Looker.
DataBrew offers 40 interactive sessions for first-time users that are completely free. This provides ample leeway to test and confirm that the platform is suited to your business requirements.
Per Session Pricing
DataBrew sessions are charged on a per-session basis. Each session costs $1.
AWS DataBrew sessions are 30 minutes each. Each session is interactive, so moving the mouse means the clock is ticking. No activity at the end of a 30-minute period
You log on at 2 PM and work until 2:29PM = $1 cost.
You log on at 2 PM and work until 2:29 PM. Then, you log back in at 2:35 PM and forget to log off. You are forced to log off at 3 PM by DataBrew = $2 costs.
You log in 5 minutes between 2 and 2:30 PM, 5 minutes between 2:30 and 3 PM, and 5 minutes between 3 and 3:30 PM = $3 costs.
As you can see, pricing is based on 30-minute blocks. Interacting with the service within these blocks will trigger a $1 charge, even if you only spend a minute logged in. You can reduce costs by maximizing work completed in each session as well as by logging off immediately once work is complete.
To learn more about Interactive Session pricing, you can access the AWS calculator for DataBrew.
Learn More: Amazon Redshift Consulting Services
Per Node Pricing
DataBrew jobs use nodes to calculate pricing. Each node includes four virtual CPUs and 16GB of RAM. The default for each job is five nodes.
A node hour costs $0.48. This is calculated per minute.
Additional AWS Services
You will be charged for each additional service if you use supplementary AWS services alongside DataBrew jobs. For example, the AWS Glue Data Catalog.
You run a DataBrew data profile job for 60 minutes using 5 nodes for one job. This means you have 5 node hours total: 5 nodes x $0.48 per hour = $2.40
You run a DataBrew job for 30 minutes, using 10 nodes for two jobs: 10 nodes x 0.5 hours = 5 full node hours or $2.40 again.
Despite its cost-effective pricing, AWS Glue DataBrew might not be the best fit for your use cases. Here are a few DataBrew competitors and alternatives to consider:
Pandas is an open-source tool for data analysis and manipulation. Built in the Python programming language, it allows users to create objects, view and select data, populate missing data, merge datasets, and much more. It supports CSV, TXT, HTML, JSON, LaTeX, XML, SQL, SAS, and SPSS as “IO tools.”
Numerical Python (NumPy) is another tool used across all data industries. It’s an open-source Python library focused on numerical data (Pandas is actually based on this as a source). It allows users to create arrays, add and remove elements, alter shapes and sizes of arrays, convert from 1D to 2D arrays, and create or reshape matrices.
Where the previous two competitors are free and open source, Alteryx is a paid automated analytics service. It matches DataBrew with data quality and preparation tools, including data access, catalogs, exploration and profiling, and enrichment. More broadly, Alteryx also offers an analytics cloud, analytics process automation, and machine learning.
Dask is another free, open-source Python library. It is built in coordination with NumPy and Pandas. It allows users to create a Dask array, Dask bags, DataFrame, collections, and more. The focus is enabling parallel, scalable Python solutions for data preparation using schedulers that execute task graphs.
As an AWS premier partner, Trianz can help you connect data with data engineers and ETL developers using AWS Glue to make it easy to discover, prepare, and combine data for analytics, machine learning, and application development. Whether you are looking to easily explore new data sets in a visual interface using DataBrew, or want to learn more about how AWS Glue can save you up to 50% on your legacy migrations — Trianz is here to help.
For more in-depth information on AWS Glue, check out our ETL migration page or schedule a call with one of our ETL migration experts.
Application Modernization at Speed and Scale Enterprises are pursuing greater application scalability, cost efficiency, and standardization with containerization and virtualization platforms. So, what’s the difference? Containers are a type of virtualization technology that allows users to run multiple operating systems inside a single instance of an OS. They are lightweight and portable, making them ideal for running applications across different platforms.Explore
Container Orchestration or Compute Service? Amazon Web Services (AWS) offers a range of cloud computing services to meet enterprise needs. Included in its service offering is the elastic compute service (ECS) and elastic compute cloud (EC2). Choosing between these two services can be difficult, as one focuses on virtualization while the other manages containerization. In the following article, we will explore the differences between Amazon ECS and EC2 to help you better understand which service is right for your use case.Explore
What is Application Modernization? Application modernization is the process of converting, rewriting, or porting legacy software packages to operate more efficiently with a modern infrastructure. This can involve migrating to the cloud, creating apps with a serverless architecture, containerizing services, or overhauling data pipelines using a modern DevOps model.Explore
What are the Differences? Though often used interchangeably, data pipelines and ETL are two different methodologies for managing and structuring data. ETL tools are used for data extraction, transformation, and loading. Whereas data pipelines encompass the entire set of processes applied to data as it moves from one system to another. Sometimes data pipelines involve transformation, and sometimes they do not.Explore
Finding Hidden Patterns and Correlations Innovative technologies such as artificial intelligence (AI), machine learning (ML) and natural language processing (NLP) are transforming the way we approach data analytics. AI, ML and NLP are categorized under the umbrella term of “cognitive analytics,” which is an approach that leverages human-like computer intelligence to identify hidden patterns and correlations in data.Explore
The Rise in Big Data Analytics According to Internet World Stats, global internet usage increased by 1,339.6% between 2000-2021. With nearly thirteen times as many people using the internet, this has resulted in a massive increase in the amount of data being processed daily. Our increased sharing and consumption of digital media also compounds this increased usage to create an enormous pool of data for big data analytics firms to process.Explore