We all know that cloud architectures have redefined IT. Elastic resource availability, usage-based pricing, and reduction of in-house IT management burden have proven compelling for most modern enterprises. Odds are that your organization is getting more and more serious about running operations and analytics workloads in the public cloud, most notably Amazon Web Services (AWS), Azure, or Google Cloud Platform. Most data infrastructure, including data warehouses, data lakes, and streaming systems, can run in the cloud, where they need the same data integration and processing steps outlined in this Chapter 3.
As you adopt cloud platforms, data movement becomes more critical than ever. This is because clouds are not a monolithic, permanent destination. Odds are also that your organization has no plans to abandon on-premises datacenters wholesale anytime soon. Hybrid cloud architectures will remain the rule. In addition, as organizations’ knowledge and requirements evolve, IT will need to move workloads frequently between clouds and even back on-premises.
Change Data Capture (CDC) makes the necessary cloud data transfer more efficient and cost-effective. You can continuously synchronize on-premises and cloud data repositories without repeated, disruptive batch jobs. When implemented correctly, CDC consumes less bandwidth than batch loads, helping organizations reduce the need for expensive transmission lines. Transmitting updates over a wide area network (WAN) does raise new performance, reliability, and security requirements. Typical methods to address these requirements include multipathing, intermediate data staging, and encryption of data in flight.
A primary CDC use case for cloud architectures is zero-downtime migration or replication. Here, CDC is a powerful complement to an initial batch load. By capturing and applying ongoing updates:
The sports fan merchandiser Fanatics realized these benefits by using CDC to replicate 100 TB of data from its transactional, ecommerce, and back-office systems, running on SQL Server and Oracle on-premises, to a new analytics platform on Amazon S3. This cloud analytics platform, which included an Amazon EMR (Hadoop)–based data lake, Spark, and Redshift, replaced the company’s on- premises data warehouse.
By applying CDC to the initial load and subsequent updates, Fanatics maintained uptime and minimized WAN bandwidth expenses. Its analysts are gaining actionable, real-time insights into customer behavior and purchasing patterns.