Enterprises have, over time, invested in a variety of tools, technologies, and methodologies to solve the critical problem of managing enterprise data assets, be it data catalogs, security policies associated with data access, or encryption/decryption of data (in motion and at rest) or identification of PII, PHI, PCI data.
As technology has evolved, so have the tools and methodologies to implement the same. However, the issue continues to persist. There are a variety of reasons for the same:
Enterprise engagement in digital transformation is changing the application landscape from on-prem to cloud (including SaaS)
As the application landscape is changing, so is the data landscape giving way to cloud and cloud data warehouses from MPPs, Hadoop ecosystems, et al.
The integration layers, whether API or ETL /ELT, is also undergoing a change, and so is the visualization layer
The emergence of AI/ML both at the application, cloud, and data layer
Catching up in a constant change scenario is the biggest challenge that Enterprises are facing and seems like a never-ending cycle.
There are ways and means to overcome the challenge if implemented right. The following are a few observations:
Enterprises have implemented Data Governance tools and captured the business and technical metadata as it relates to the application ecosystem. Data Catalogs are created as part of the process (including rules associated with access, security, etc.). As and when the applications undergo a change, the corresponding changes need to be reflected in the Data Governance tools, whether it is a change in metadata, security, or business rules.
Enterprises invest in data ecosystems, whether on-prem or cloud. The data sources for the ecosystems comprise internal applications and external data sets. The data ecosystems capture the metadata wherever available (more technical than business metadata). As a result, the data pipelines that pull data from applications into the data ecosystems have a data lineage that may or may not match the Enterprise Data Catalogs created in the Data Governance tools. A handshake (API) of the data ecosystem and Enterprise Data Catalog in the Data Governance tool is required so that the information is identical irrespective of which application a user seeks the information from.
Enterprises often invest in tools that handle Identity and Access Management (IAM) as well as Data Security policies that are typically role/persona based. A tight Integration between the IAM, Data Governance, and Data ecosystem is very critical, wherein changes in policies are reflected across applications and implemented efficiently in an automated manner.
Most tools have the capability to democratize data following a standard; however, if the standard is not followed effectively, there is a chance that data assets will get wrongly classified. An Enterprise must follow the laid-out standards, irrespective of the Component Business Model or any other method for classifying data assets. The following steps may help to ensure data classification follows a methodical process:
Business Functions (Level 0): the first step would be identifying the various Business Functions within the Enterprise (Sales and Marketing, Supply Chain, HR, Finance, Operations, etc.).
Sub Functions (Level 1): the second step would be identifying the Sub Functions associated with the Business Functions identified above. Kindly note, not all organizations are alike, and a standard cookie-cutter approach won't work (at most, it will be a guide).
Business Processes (Level 2): the third step would be identifying the Business Processes that make a Sub Function.
Activities (Level 3): the fourth step is the identification of Activities that make up a Business Process.
Enterprise Applications, if one thinks it through, cater to a set of Activities that make a Business Process, and a set of Business Processes together make a Sub-Function. A set of Sub-Functions make the Business Function, and a set of Business Functions make the Enterprise.
Given the above thought process, Enterprises in the data ecosystem strive to achieve the task of arranging data sets (not assets) in the above manner. Most organizations take the approach of arranging data assets, which may not be the right way as there may be duplication of data assets within the Enterprise and associated issues.
The role of AI/ML in terms of Observability and Action is critical when it comes to ensuring that changes in the application landscape or data landscape are observed and actioned upon in order to ensure that the data catalogs, lineage work in tandem with the changes made.
However, one may argue that the investments already made by Enterprises in data ecosystems, like data platforms and tools associated with data cataloging, governance, quality, security, etc., should be considered while implementing the above thought process. Keeping in mind the investments made by an Enterprise, the following thoughts may help:
Shift the thinking from a "Push" or "Ingest" or "Event Streams" to a "Serving and Pull" model across Data Domains
Data will be "Contextualized" in different Domains as it transforms into a format applicable to a particular Domain
Visualize a scenario where a "Player Domain" owns and provides data sets for access to any team for any purpose downstream. The technical aspect of where the data sets reside and how they flow falls under the area of the "Player Domain"
The Source Domain data sets capture the data closely aligned to what is generated by the operational systems they originate from, systems of reality
The Consumer Domain data sets and their respective teams aim to fulfill a set of closely related use cases
The Domain team should apply a Product mindset when it relates to a data set covering aspects like Discoverability, Addressability, Trustability, Self-Describing, Inter-Operability, and Security. The following points need to be considered:
Data Catalog of all available data products with their metadata information and lineage
Once discovered, it should have a unique address following a global convention that helps its users to programmatically access it
Maintain high quality and certify data set to make it trustworthy
Should be easily understandable and consumable
Harmonization of the data for co-relation of data across domains
Access control is applied at a finer granularity (both in motion and at rest) by the Domain owner
The concept of Federated Computational Governance needs to be mentioned, as the goal is to achieve interoperability across all data products through standardization. The following key aspects need to be kept in mind:
Policies on interoperability allow other domain teams to use data products in a consistent way.
Interoperability policies ensures consistent data format, discovery mechanism and access control across domains.
Documentation to ensure constancy on how individual domains are maintained and accessible for consumption.
The logical solution architecture for a Data Product focused data ecosystem (keeping in mind the investments already made) would be like below:
Now one would ask, how does the above get implemented in a real-world scenario? We at Trianz believe in creating IP Led Solutions, and our flagship solution is called "Extrica". Our value proposition is: "Enabling people to easily connect to data they trust to make better decisions". The core features of Extrica include:
Extrica is built on AWS using the native features of AWS and takes into consideration the investments made by an Enterprise. Each of the above features is critical from an Enterprise standpoint and propagates the concept of Data Products by accessing our industry-specific library of Data Products and pre-defined KPIs/Metrics. The ease with which Queries can be created within Extrica to create user-defined data products and catalogs is what makes it the platform of choice for users. Data Producers and Consumers can define Data Quality rules or apply the rules from the available library of rules to make the Data trustable and enhance (AI/ML) the Trustability Score displayed on the Data Product Card.
For more information, please connect with us at: [email protected]
Contact Us Today
Connecting more people to data has become imperative for organizations worldwide. In Top Trends in Data & Analytics for 2022, Gartner stated, “Connections between diverse and distributed data and people create truly impactful insight and innovation. These connections are critical to assisting humans and machines in making quicker, more accurate, trustworthy, and contextualized decisions while considering an increasing number of factors, stakeholders, and data sources.”Explore
Since the dawn of business, users have looked for three main components when it comes to data: Search | Secure| Share. Now let's talk about the evolution of data over the years. It's a story in itself if one pays attention. Back then, applications were created to handle a set of processes/tasks. These processes/tasks, when grouped logically, became a sub-function, a set of sub-functions constituted a function, and a set of functions made up an enterprise. Phase 1 – Data-AwareExplore
Practitioners in the data realm have gone through various acronyms over the years. It all started with "Decision Support Systems" followed by "Data Warehouse", "Data Marts", "Data Lakes", "Data Fabric", and "Data Mesh", amongst storage formats of RDBMS, MPP, Big Data, Blob, Parquet, Iceberg, etc., and data collection, consolidation, and consumption patterns that have evolved with technology.Explore
Finding Hidden Patterns and Correlations Innovative technologies such as artificial intelligence (AI), machine learning (ML) and natural language processing (NLP) are transforming the way we approach data analytics. AI, ML and NLP are categorized under the umbrella term of “cognitive analytics,” which is an approach that leverages human-like computer intelligence to identify hidden patterns and correlations in data.Explore
The Rise in Big Data Analytics According to Internet World Stats, global internet usage increased by 1,339.6% between 2000-2021. With nearly thirteen times as many people using the internet, this has resulted in a massive increase in the amount of data being processed daily. Our increased sharing and consumption of digital media also compounds this increased usage to create an enormous pool of data for big data analytics firms to process.Explore
What Is an SQL Query Engine? SQL query engine architecture was designed to allow users to query a variety of data sources within a single query. While early SQL-based query engines such as Apache Hive allowed analysts to cut through the clutter of analytical data, they found running SQL analytics on multi-petabyte data warehouses to be a time-intensive process that was difficult to visualize and hard to scale.Explore