Why Change Data Capture (CDC)? The Business Case
There are many ways in which the speed of data transfer directly feeds the bottom line. At the most basic level, organizations need to do two things with data:
Typically, the same database cannot support both actions because the underlying server has only so much CPU processing power available. It is not acceptable for an analytics query to slow down production workloads such as the processing of online sales transactions. Hence the need to analyze production records on a different platform.
The traditional method of replicating production data to an analytics platform is full loading, also known as bulk or batch loading. This process creates files or tables at the target, defines their “metadata” structures based on the source and fully populates those target tables or files with data copied from the source. Because full loads take time and often consume significant processing power on the source, administrators usually need to pause production operations during planned “batch windows” that take hours or longer. Batch windows are disruptive and increasingly unacceptable in today’s global, 24×7 business environment.
Change data capture (CDC) is different. CDC replicates only the most recent production data and metadata changes that the source has registered during a given time period — typically seconds or minutes. CDC can accompany batch loads to ensure targets are in sync upon load completion. It also can become the primary synchronization method thereafter. Like full load, CDC can copy data from one source to one target, or one source to multiple targets.
Advantages of Change Data Capture (CDC)
Change Data Capture CDC has three primary advantages:
The value of the first benefit should by now be self-evident.
The second benefit is critical because it makes replication more feasible for a variety of use cases. An analytics team might be willing to wait for the next nightly batch load to run their queries (although that’s increasingly
less common). But even then, they cannot stop their 24×7 production databases for a bulk/batch load. When they are able to kill the batch window with CDC technology, they can keep production running full time.
The third benefit is also self-evident. CDC is a powerful complement to batch loads in the case of data migrations. By capturing and applying ongoing updates, CDC can ensure that the target is and remains fully current upon completion of a batch load. The analytics team can switch from source to target without any pause in production operations.
Methods of Capturing Data
There are several ways in which CDC software identifies and copies changes.
CDC Integration with Hadoop Data Lakes
Data lakes have emerged as a primary platform on which you can cost-effectively store and process a wide variety of data types. Approximately one quarter of organizations are already using a data lake, and another quarter expects to do so within twelve months (source: “Data Lakes: Purposes, Practices, Patterns and Platforms” TDWI, 2017). Hadoop Data lakes support myriad analytics use cases, including fraud detection, real-time customer offers, market trend/pricing analysis, social media monitoring and more. And most of these use cases require real-time data.
With the Apache open source ecosystem, whose primary components are available in distribution packages from Hortonworks, Cloudera and MapR, the historical batch processing engine MapReduce is increasingly being complemented or replaced by real-time engines such as Spark and Tez. These engines run on data stored in repositories such as the Hadoop File System (HDFS) and Hive SQL-like data stores. They also can run on Amazon S3 within the Amazon EMR Hadoop distribution.
Full load and CDC replication software can access HDFS through the high-performance native protocol WebHDFS, which in turn uses the Kerberos open source network protocol to authenticate users. Once connected and authenticated, data architects and other users can use either full load or CDC to land source data in change tables in HDFS. The data can be queried there, using MapReduce batch processing for historical analyses.
CDC becomes critical when real-time engines like Spark come into play. For example, the Spark in-memory processing framework can apply machine learning or other highly iterative workloads to data in HDFS. Alternatively, users can transform and merged data from HDFS into Hive data stores with familiar SQL structures that derive from traditional data warehouses. Here again, rapid and real-time analytics requires the very latest data, which requires CDC.
Here is a maturity model for the Hadoop ecosystem:
This table gives the overview of how Sqoop compares to Attunity’s solutions.