In this technical post, let's take a look under the hood of Qlik Replicate (formerly Attunity Replicate), which is the foundation of our complete data integration platform. Here we look at Qlik Replicate's support for both Full Load (a.k.a. batch) and change data capture (CDC).
The architecture of Qlik Replicate is web based and typically acts as a middle-tier server, comprised of three domains: the sources, the replication server and the targets. Data sources and targets can be on-premises and in the cloud. DBAs and Data Engineers can interact with these domains through an intuitive web-based interface with Console and Designer views. This web-based architecture follows through the rest of the data integration platform, which unlocks information assets through high performance, scalable and easy-to-use solutions to create high performance data pipelines.
Support for Full Load and CDC Replication
With Qlik Replicate, the Full Load process is comprised of three steps:
Unlike the CDC process, data is loaded into one or multiple tables or files at a time. This is done for efficiency reasons. Although the source tables may be subject to update activity during the Full Load process, there is no need to stop applications in the source. To guarantee the consistency of the data, the CDC process is automatically activated as soon as the Load starts.
However, changes are not applied at the target until after the table loading completes. Although the data on the target may not be consistent while the Load is active, the target data’s consistency and integrity at the conclusion of the Load are guaranteed. In addition, the Load process can be interrupted. When it restarts, it will continue from wherever it was stopped. New tables can be added to an existing target without reloading the existing tables. Similarly, columns in previously populated target tables can be added or dropped without the need to reload.
The CDC process obtains a stream of filtered events or changes in data or metadata from the transaction and archive log files from the source. One of its most important functions is to buffer all the changes for a given transaction into a single unit before it is sent forward to the target when the transaction commits. As already mentioned above, during the initial load, CDC also buffers all the changes that occur within a transaction until after all affected tables have finished being loaded. If changes cannot be applied to the target database in a reasonable timeframe, they are buffered on the replication server for as long as necessary. This alleviates the need for re-reading the source database logs, which could take significant amounts of time.
Two important processes that are essential for both Full Load and CDC are Filter/Compress and Transformation:
Qlik Replicate will apply these lightweight filtering and transformations in-memory, and it is worth noting that additional more complex transformations can be applied by utilizing the rest of the Qlik Data Integration platform further upstream, such as automatic creation of data marts, warehousing and lakes by taking advantage of Data Warehouse Automation, Data Lake Creation and Enterprise Data Catalog, which are comprised of other products called Qlik Compose and Qlik Catalog.
There is a lot of flexibility in the data flows that can be configured with Qlik Replicate, such as replicating data from one source to many targets (fan-out) or many sources to one target (fan-in) and moving data from multiple disparate systems into the cloud.
A quick note to make for the persistent store is that this is not a permanent store for the data that is being replicated; it stores the configuration and state of a task.
Each replication source and target is configured as a single task, and it’s this configuration that is stored (metadata, selected fields and tables, and the required transformations). The state of the last replicate task is also stored, so Qlik Replicate can pick up where it left off in case of interruptions, errors or pauses in the replication tasks. Think bookmarks.
And all of this can be done at scale. We have many customers with hundreds of tasks running in their production systems, automating their data integration requirements.
In a separate post, I'll detail Qlik Replicate's zero-footprint architecture and scalability.
If you would like to try Qlik Replicate for yourself, you can take a Test Drive in a controlled sandbox environment and see how easy it is to start data replication or get in touch to discuss your requirements.