Over the course of my next two blog posts, I would like to share my thoughts around a debate raging in data architecture circles. The bone of contention? That the 21st century needs a new data management paradigm for modern analytics. First up, I’ll frame the argument and explain the two prominent approaches of data hub and data fabric. Then, I’ll cover data mesh and compare all three architectures. As always, I’d love to get your input, feedback, queries and comments!
Are They Brand New Paradigms or Just Reworked Data Strategies?
One of the most animated, engaging and interesting arguments between Data Architects in recent months has been the discussion of the merits of different data ontologies. The debate has been reignited due to the renewed interest in the Data Mesh concept. Adherents claim that Data Mesh is the definitive architecture for the 21st century that addresses the decades of data management failures. However, before we go any further let’s first review the hub, fabric and mesh concepts and then draw our conclusions.
A data hub is an architecture and strategy for data management, not just a singular product. Think of it as a central data repository, with spokes that radiate to systems and customers. Therefore, a data hub architecture simply enables data sharing by connecting producers of data with consumers of data.
Figure 1. A Data Hub
Gartner Research first published a paper on the topic in 2017. The publication recommended a technology-agnostic architecture for connecting data producers with consumers, especially when compared with point-to-point alternatives. They further refined and evolved the concept in subsequent research and currently define the attributes of a data hub as follows:
Figure 2. Data Hub Attributes
Models describe how the data stored in the hub is structured and consumed. Governance defines data privacy, access, control, security, retention and disposal policies, as well as the owners of those policies. Integration defines the style and method of working with the data in hub (e.g., API, ETL, etc.). Persistence defines the category of data store (e.g., relational database).
Gartner later hypothesized that there were many uses for specialized, purpose-built data hubs. These were detailed as follows:
Finally, Gartner’s revised notion of a data hub specified that a company could concurrently operate several hubs in a distributed manner as shown below:
Figure 3. Gartner's Specialized Data Hubs
Noel Yuhanna of Forrester Research first coined the notion of a “Big Data Fabric” in late 2016 with the publication of “Forrester Wave: Big Data Fabric.” He described a technology-oriented approach that combined disparate data sources automatically, intelligently, and securely that processed them in a big data platform using Hadoop and Apache Spark. An organization could then present unified data from disparate data sources to downstream consumers. The goals of the data fabric are twofold. 1. Increase agility by creating a semantic layer that can be automated to accelerate the delivery of data and insights. 2. Minimize complexity by creating automated pipelines that streamline data ingestion, integration and curation.
Noel’s Big Data Fabric architecture consisted of five components:
Noel continued to evolve the Big Data Fabric concept in the intervening years. Noel’s current data fabric thinking focuses on addressing broader business use cases, such as a 360-degree view of the customer, customer intelligence, and internet of things analytics. It includes components such as AI/ML, data catalog, data transformation, data preparation, data discovery, data governance and data modeling, which supports end-to-end data management capabilities.
Figure 4. Forrester's Data Fabric
Gartner similarly embraced the data fabric moniker and defined the concept similarly as follows:
“A data fabric is an emerging data management and data integration design concept for attaining flexible, reusable and augmented data integration pipelines, services and semantics, in support of various operational and analytics use cases delivered across multiple deployment and orchestration platforms. Data fabrics support a combination of different data integration styles and utilize active metadata, knowledge graphs, semantics and ML to augment data integration design and delivery.”
More specifically, Gartner defines five technological attributes as part of the data fabric. They are as follows:
Figure 5. Data Fabric Attributes
Figure 6. Gartner's Data Fabric Pillars
Once again, a data fabric architecture is vendor and technology agnostic. It doesn’t describe topologies, storage mechanisms, workflows, or formats. However, today’s enterprise data fabric solutions are a set of capabilities that minimize integration complexity by automating processes, workflows and pipelines, generate code, and streamline data access to accelerate the time to market for multiple data consumption scenarios.
A passionate argument is raging throughout the data architect community whether a new data management paradigm is required for analytics in the 21st century. I’ve documented two approaches: data hub and data fabric. Which do you prefer