The storage of digital information is commonplace in the modern world, with businesses having a heavy reliance on customer data to fuel growth and influence decisions. The term big data is thrown around a lot, along with data warehousing, without many people knowing what it is.
But what do these terms mean, and are they important to your business?
What is Big Data?
The term big data is relatively self-explanatory. It is used to describe the vast quantities of data that are generated on a daily basis by billions of people across the web.
The term also alludes to:
- The Variety of different data types, coming from various sources.
- Volume in terms of how quickly information is being generated and how much processing power is needed to categorize the data.
- Velocity is the measure of how quickly data is processed and is highly dependant on the quantities of data being inputted.
- Differing quality of data, also known as Veracity, refers to inconsistencies and vague correlations that will require further manual administration or disposal.
- The inherent Value the data offers, including what can be extracted for use by a business.
Many people refer to these as the five V’s, which are Velocity, Volume, Value, Variety, and Veracity.
Overall, Big Data describes a mixture of semi-structured and unstructured data as a collective entity. All of this information gets processed, extracting the useful attributes and disposing of the excess ready for more specialized use, essentially turning petabytes of mixed-quality information into terabytes of specifically useful information for storage in a data warehouse environment.
With Cisco predicting that almost 5 zettabytes of data will be transferred over the web in 2022, you can see how important understanding the concept of big data is. Trianz can help you to better adapt your business thanks to their expertise in Big Data consulting.
What is Data Warehousing?
With big data being a form of technology, data warehousing is a form of architecture for data storage.
Data warehousing is predominantly used to handle structural data, specifically relational data. Only information compatible with a database management system (DBMS) can be used here, unlike big data that can handle a much broader range of data types. In stark contrast to the cluttered mess of information within a big data environment, a data warehouse will typically contain information for use by business intelligence and database software.
Approaches to Data Warehousing
There is much confusion around data warehousing, but there are two main interpretations to know about:
The Inmon Approach
This approach to creating a data warehouse relies heavily on your existing corporate data model.
Your business operates within a specific niche of the market, a market segment containing customer, product and vendor categories. Each of these individual categories will have their own separate model in which specifically related metadata will be stored. This metadata may include:
- Attributes – such as customer names, DOB’s, genders, etc.
- Dependencies – or a constraint that dictates the relationship that attributes have with one another.
- Participation – this is a more complex metadata type, which relies on minimum cardinality. Low cardinality means that there are many repeated values, with high cardinality indicating many unique values. In a database, cardinality relationships are formed of ‘one-to-one’, ‘one-to-many’, and ‘many-to-many’ relationships.
- Relationships – as a further example of cardinality, specifically a ‘one-to-many’ relationship with this example: one customer can order multiple products.
In contrast, using a ‘one-to-one’ relationship: one purchase order can only correlate with one customer.
The definition of a data warehouse, according to the Inmon approach, describes a centralized repository used across the whole business. With Inmon, the warehouse is implemented in a normalized manner, which reduces the complexity of loading data, but requires the setup of tables and joins to ensure query functions work well. Due to this, most Inmon implementations use something called data marts. These data marts are department specific data repositories that divide up the database depending on who is using it, rather than granting everyone accesses to all the stored information in the database.
The Kimball Approach
The Kimball approach works from the ground up, identifying the vital questions that a database needs to answer, before building the database around those requirements.
Kimball implementations will start by analyzing the operational systems in which your data relies on. Then, something called ‘Extract, Transform, Load’ (ETL) software will pull data from these systems into a staging area before loading it into an accessible dimensional model.
Kimball uses something called the “star schema”. A schema simply refers to a group of related tables within a database, which can be either “operational” or “reporting”. The term “star schema” relates to how this group (when formatted as a diagram) resembles a star shape. The central point of a star schema contains a fact table, consisting of all measures relating to a subject area, along with foreign keys from the surrounding dimensions.
Unlike Inmon, a dimension (which is an individualized, non-overlapping specific dataset) is denormalized, allowing you to drill up and down between relating datasets. As an example, a car can:
- Be categorized by type – SUV, Sedan, Coupe
- Be categorized by manufacturer
- Be categorized by year of production
Each of the above would have their own dimension, allowing you to narrow down your search parameters and get specific information on the subject using data drilling without the need to connect to another table. From here, you have a simple explanation of how Kimball works.
To recap, multiple star schemas could be created for different reporting requirements across departments, all with their own dimensions and fact tables. Specific dimensions, such as customer and product information, could be made globally available to all fact tables across the different star schemas in the implementation, ensuring a “single source of truth” is referenced when making business decisions. Simply, everyone references the same central data points, reducing the risk of skewed results in reports.
Are Big Data and Data Warehousing that different?
At first glance, these two terms seem to describe vastly different methods of processing data. In reality, data warehousing is just a more organized and acutely specialized version of big data: better aligned with querying and reporting than mass storage.
If you are interested in either big data consulting or data warehousing consulting, Trianz has decades of experience working with businesses to identify their technology needs. Get in contact using the form below for a consultation.