The storage of digital information is commonplace in the modern world, with businesses having a heavy reliance on customer data to fuel growth and influence decisions. The term big data is thrown around a lot, along with data warehousing, without many people knowing what it is.
But what do these terms mean, and are they important to your business?
The term big data is relatively self-explanatory. It is used to describe the vast quantities of data that are generated on a daily basis by billions of people across the web.
The term also alludes to:
Many people refer to these as the five V’s, which are Velocity, Volume, Value, Variety, and Veracity.
Overall, Big Data describes a mixture of semi-structured and unstructured data as a collective entity. All of this information gets processed, extracting the useful attributes and disposing of the excess ready for more specialized use, essentially turning petabytes of mixed-quality information into terabytes of specifically useful information for storage in a data warehouse environment.
With Cisco predicting that almost 5 zettabytes of data will be transferred over the web in 2022, you can see how important understanding the concept of big data is. Trianz can help you to better adapt your business thanks to their expertise in Big Data consulting.
With big data being a form of technology, data warehousing is a form of architecture for data storage.
Data warehousing is predominantly used to handle structural data, specifically relational data. Only information compatible with a database management system (DBMS) can be used here, unlike big data that can handle a much broader range of data types. In stark contrast to the cluttered mess of information within a big data environment, a data warehouse will typically contain information for use by business intelligence and database software.
There is much confusion around data warehousing, but there are two main interpretations to know about:
This approach to creating a data warehouse relies heavily on your existing corporate data model.
Your business operates within a specific niche of the market, a market segment containing customer, product and vendor categories. Each of these individual categories will have their own separate model in which specifically related metadata will be stored. This metadata may include:
In contrast, using a ‘one-to-one’ relationship: one purchase order can only correlate with one customer.
The definition of a data warehouse, according to the Inmon approach, describes a centralized repository used across the whole business. With Inmon, the warehouse is implemented in a normalized manner, which reduces the complexity of loading data, but requires the setup of tables and joins to ensure query functions work well. Due to this, most Inmon implementations use something called data marts. These data marts are department specific data repositories that divide up the database depending on who is using it, rather than granting everyone accesses to all the stored information in the database.
The Kimball approach works from the ground up, identifying the vital questions that a database needs to answer, before building the database around those requirements.
Kimball implementations will start by analyzing the operational systems in which your data relies on. Then, something called ‘Extract, Transform, Load’ (ETL) software will pull data from these systems into a staging area before loading it into an accessible dimensional model.
Kimball uses something called the “star schema”. A schema simply refers to a group of related tables within a database, which can be either “operational” or “reporting”. The term “star schema” relates to how this group (when formatted as a diagram) resembles a star shape. The central point of a star schema contains a fact table, consisting of all measures relating to a subject area, along with foreign keys from the surrounding dimensions.
Unlike Inmon, a dimension (which is an individualized, non-overlapping specific dataset) is denormalized, allowing you to drill up and down between relating datasets. As an example, a car can:
Each of the above would have their own dimension, allowing you to narrow down your search parameters and get specific information on the subject using data drilling without the need to connect to another table. From here, you have a simple explanation of how Kimball works.
To recap, multiple star schemas could be created for different reporting requirements across departments, all with their own dimensions and fact tables. Specific dimensions, such as customer and product information, could be made globally available to all fact tables across the different star schemas in the implementation, ensuring a “single source of truth” is referenced when making business decisions. Simply, everyone references the same central data points, reducing the risk of skewed results in reports.
At first glance, these two terms seem to describe vastly different methods of processing data. In reality, data warehousing is just a more organized and acutely specialized version of big data: better aligned with querying and reporting than mass storage.
If you are interested in either big data consulting or data warehousing consulting, Trianz has decades of experience working with businesses to identify their technology needs. Get in contact using the form below for a consultation.
Contact Us Today
What Is an SQL Query Engine? SQL query engine architecture was designed to allow users to query a variety of data sources within a single query. While early SQL-based query engines such as Apache Hive allowed analysts to cut through the clutter of analytical data, they found running SQL analytics on multi-petabyte data warehouses to be a time-intensive process that was difficult to visualize and hard to scale.Explore
A Winning Base for Successful Digital Transformations When it comes to developing a successful digital strategy, it is not just corporations planning to maximize the benefits of data assets and technology-focused initiatives. The Government of Western Australia recently unveiled four key priorities for digital reform in its new Digital Strategy for 2021-2025.Explore
Engage Your Workforce with a Modern Employee Intranet Solution The employee intranet has changed significantly since it was first introduced in the early 1990s. What started as HTML-based static portals have now evolved into intuitive communication tools complete with search engines, user profiles, blogs, event planners, and more. Today, many organizations are taking a second look at employee intranets to bridge gaps between teams, build company culture, centralize information, increase productivity, and improve workflow.Explore
Adopting emerging cloud technologies, consolidating resources, and improving processes is the key. “IT no longer just supports corporate operations as it traditionally has but is fully participating in business value delivery. Not only does this shift IT from a back-office role to the front of business, but it also changes the source of funding from an overhead expense that is maintained, monitored, and sometimes cut, to the thing that drives revenue,” said John-David Lovelock, research vice president at Gartner.Explore
Deliver Powerful Insights Instantaneously with Federated Queries - No Matter Where Your Data Resides The concept of federated queries isn’t new. Facebook PrestoDB popularized the idea of distributed structured query language (SQL) query engines in 2013. Over the years, AWS, Google, Microsoft, and many others in the industry have accelerated the adoption of a distributed query engine model within their products. For example, AWS developed Amazon Athena on top of the Presto code base, while Google’s BigQuery is based on Cloud SQL.Explore
What is Unstructured Data? Almost 80% of the data that enterprises and organizations collect is unstructured - data without a set record format or structure. Unstructured data includes data such as emails, web pages, PDFs, documents, customer feedback, in-app reviews, social media, video files, audio files, and images.Explore