There are many ways in which the speed of data transfer directly feeds the bottom line. At the most basic level, organizations need to do two things with data:
- Record what’s happening to their business for sales, expenditures, hiring, etc.
- Analyze what’s happening in order to make informed decisions like which customers to target or which costs to cut by “querying” their records.
Typically, the same database cannot support both actions because the underlying server has only so much CPU processing power available. It is not acceptable for an analytics query to slow down production workloads such as the processing of online sales transactions. Hence the need to analyze production records on a different platform.
The traditional method of replicating production data to an analytics platform is full loading, also known as bulk or batch loading. This process creates files or tables at the target, defines their “metadata” structures based on the source and fully populates those target tables or files with data copied from the source. Because full loads take time and often consume significant processing power on the source, administrators usually need to pause production operations during planned “batch windows” that take hours or longer. Batch windows are disruptive and increasingly unacceptable in today’s global, 24×7 business environment.
Change data capture (CDC) is different. CDC replicates only the most recent production data and metadata changes that the source has registered during a given time period — typically seconds or minutes. CDC can accompany batch loads to ensure targets are in sync upon load completion. It also can become the primary synchronization method thereafter. Like full load, CDC can copy data from one source to one target, or one source to multiple targets.
Advantages of Change Data Capture (CDC)
Change Data Capture CDC has three primary advantages:
- It enables faster and more accurate decisions based on the most current data.
- It reduces the need for, or even replaces, batch loading.
- It enables zero-downtime migrations.
The value of the first benefit should by now be self-evident.
The second benefit is critical because it makes replication more feasible for a variety of use cases. An analytics team might be willing to wait for the next nightly batch load to run their queries (although that’s increasingly
less common). But even then, they cannot stop their 24×7 production databases for a bulk/batch load. When they are able to kill the batch window with CDC technology, they can keep production running full time.
The third benefit is also self-evident. CDC is a powerful complement to batch loads in the case of data migrations. By capturing and applying ongoing updates, CDC can ensure that the target is and remains fully current upon completion of a batch load. The analytics team can switch from source to target without any pause in production operations.
Methods of Capturing Data
There are several ways in which CDC software identifies and copies changes.
- Timestamps in a dedicated table column can record the time of the last update, thereby flagging any row containing data more recent than the last CDC replication task. IBM InfoSphere CDC uses this method. While timestamps are a straightforward method of tracking updates, CDC can slow operations by frequently querying timestamp columns within production tables. They also can create complexity as administrators need to ensure time zones are accurately represented.
- Triggers log transaction events in an additional “shadow” table that can be “played back” to copy those events to the target on a regular basis. Oracle GoldenGate CDC uses this method when replicating data from SQL Server 2016. While triggers enable the necessary updates from source to target, the creation of these additional tables can increase processing overhead and slow operations.
- Log Readers, as the name implies, identify new transactions by scanning change logs, usually from a remote server. This method is often the fastest and least disruptive of the CDC options because it requires no new table and does not query production operations. Attunity Replicate uses this approach.
- Query-based CDC also is somewhat self-explanatory. Some source databases and data warehouses, such as Teradata, do not have change logs, and therefore require CDC to regularly query the production database for changes. This method can involve querying timestamp columns as described above.
CDC Integration with Hadoop Data Lakes
Data lakes have emerged as a primary platform on which you can cost-effectively store and process a wide variety of data types. Approximately one quarter of organizations are already using a data lake, and another quarter expects to do so within twelve months (source: “Data Lakes: Purposes, Practices, Patterns and Platforms” TDWI, 2017). Hadoop Data lakes support myriad analytics use cases, including fraud detection, real-time customer offers, market trend/pricing analysis, social media monitoring and more. And most of these use cases require real-time data.
With the Apache open source ecosystem, whose primary components are available in distribution packages from Hortonworks, Cloudera and MapR, the historical batch processing engine MapReduce is increasingly being complemented or replaced by real-time engines such as Spark and Tez. These engines run on data stored in repositories such as the Hadoop File System (HDFS) and Hive SQL-like data stores. They also can run on Amazon S3 within the Amazon EMR Hadoop distribution.
Full load and CDC replication software can access HDFS through the high-performance native protocol WebHDFS, which in turn uses the Kerberos open source network protocol to authenticate users. Once connected and authenticated, data architects and other users can use either full load or CDC to land source data in change tables in HDFS. The data can be queried there, using MapReduce batch processing for historical analyses.
CDC becomes critical when real-time engines like Spark come into play. For example, the Spark in-memory processing framework can apply machine learning or other highly iterative workloads to data in HDFS. Alternatively, users can transform and merged data from HDFS into Hive data stores with familiar SQL structures that derive from traditional data warehouses. Here again, rapid and real-time analytics requires the very latest data, which requires CDC.
Here is a maturity model for the Hadoop ecosystem:
This table gives the overview of how Sqoop compares to Attunity’s solutions.