Are They Brand New Paradigms or Just Reworked Data
One of the most animated, engaging and interesting arguments
between Data Architects in recent months has been the discussion of the merits of
different data ontologies. The debate has been reignited due to the renewed
interest in the Data Mesh concept. Adherents claim that Data Mesh is the definitive
architecture for the 21st century that addresses the decades of data management
failures. However, before we go any further let’s first review the hub, fabric
and mesh concepts and then draw our conclusions.
A data hub is an architecture and strategy for data
management, not just a singular product. Think of it as a central data
repository, with spokes that radiate to systems and customers. Therefore, a
data hub architecture simply enables data sharing by connecting producers of
data with consumers of data.
Figure 1. A Data Hub
Gartner Research first published a paper on the topic in 2017. The publication recommended a technology-agnostic architecture for connecting data producers with consumers, especially when compared with point-to-point alternatives. They further refined and evolved the concept in subsequent research and currently define the attributes of a data hub as follows:
Figure 2. Data Hub Attributes
Models describe how the data stored in the hub is structured and consumed. Governance defines data privacy, access, control, security, retention and disposal policies, as well as the owners of those policies. Integration defines the style and method of working with the data in hub (e.g., API, ETL, etc.). Persistence defines the category of data store (e.g., relational database).
Gartner later hypothesized that there were many uses for specialized, purpose-built data hubs. These were detailed as follows:
- Analytics data hub: For collection and sharing of data for downstream analytics processes.
- Application data hub: Is a domain context for a specific application or suite.
- Integration data hub: Focuses on sharing data using various integration styles.
- Master data hub: Focused on sharing of master data across enterprise operational systems and processes.
- Reference data hub: Similar in purpose to application/master data hubs, but with a narrower scope of “reference data” (e.g., commonly used codes).
Finally, Gartner’s revised notion of a data hub specified that
a company could concurrently operate several hubs in a distributed manner as
Figure 3. Gartner's Specialized Data Hubs
Noel Yuhanna of Forrester Research first coined the notion of a “Big Data Fabric” in late 2016 with the publication of “Forrester Wave: Big Data Fabric.” He described a technology-oriented approach that combined disparate data sources automatically, intelligently, and securely that processed them in a big data platform using Hadoop and Apache Spark. An organization could then present unified data from disparate data sources to downstream consumers. The goals of the data fabric are twofold. 1. Increase agility by creating a semantic layer that can be automated to accelerate the delivery of data and insights. 2. Minimize complexity by creating automated pipelines that streamline data ingestion, integration and curation.
Noel’s Big Data Fabric architecture consisted of five components:
- Data Ingestion – Data ingestion imports disparate data into the fabric where it is stored, analyzed and accessed.
- Data Management – A common set of data management capabilities, such as data lineage, data quality etc. that are used throughout the fabric.
- Data Orchestration – Automates processes of developing and managing data end-to-end data workflows or pipelines, such as bringing data together from multiple sources, combining it, and preparing it for data analysis.
- Data Discovery – Involves data collection and evaluation to understand trends, patterns and relationships between elements.
- Data Access – Facilitates governed role-based access control to the data via a self-service user interface or programmatic API.
Noel continued to evolve the Big Data Fabric concept in the intervening years. Noel’s current data fabric thinking focuses on addressing broader business use cases, such as a 360-degree view of the customer, customer intelligence, and internet of things analytics. It includes components such as AI/ML, data catalog, data transformation, data preparation, data discovery, data governance and data modeling, which supports end-to-end data management capabilities.
Figure 4. Forrester's Data Fabric
Gartner similarly embraced the data fabric moniker and defined the concept similarly as follows:
“A data fabric is an emerging data management and data integration design concept for attaining flexible, reusable and augmented data integration pipelines, services and semantics, in support of various operational and analytics use cases delivered across multiple deployment and orchestration platforms. Data fabrics support a combination of different data integration styles and utilize active metadata, knowledge graphs, semantics and ML to augment data integration design and delivery.”
More specifically, Gartner defines five technological attributes as part of the data fabric. They are as follows:
Figure 5. Data Fabric Attributes
- Active Metadata – Traditionally, metadata cataloged passive data elements such as schema definitions, field types and data values. However, Gartner’s expanded definition includes technical metadata elements. Gartner defines active metadata as the combination of the different types of metadata with knowledge graph relationships.
- Knowledge Graphs – Store and visualize the complex relationships between multiple data entities. Furthermore, knowledge graphs help non-technical users interpret and maintain data taxonomies and ontologies of non-technical resources.
- AI/ML Capabilities – Automatically assist, enhance or perform various data management activities up and down the stack.
- Dynamic Data Integration – Data integration in Forrester’s original definition defined different styles and modes of data ingestion. Gartner’s definition once again envisions a more adaptive and dynamic activity where data integration and data delivery optimizations are proposed based on the active metadata, AI and ML recommendations.
- Automated Data Orchestration – Integrates, transforms and delivers data to support various data and analytics use cases across the enterprise. Additionally, automated data orchestration enables users to leverage DataOps principles across the entire process for agile, repeatable and reliable data pipelines.
Figure 6. Gartner's Data Fabric Pillars
Once again, a data fabric architecture is vendor and
technology agnostic. It doesn’t describe topologies, storage mechanisms, workflows,
or formats. However, today’s enterprise data fabric solutions are a set of
capabilities that minimize integration complexity by automating processes,
workflows and pipelines, generate code, and streamline data access to
accelerate the time to market for multiple data consumption scenarios.
A passionate argument is raging throughout the data architect community whether a new data management paradigm is required for analytics in the 21st century. I’ve documented two approaches: data hub and data fabric. Which do you prefer