Streaming Change Data Capture

What is Streaming Change Data Capture?

Change data capture (CDC) software enables continuous incremental replication by identifying and copying data updates as they take place. This makes it a foundational technology for modernizing your data environment. When designed and implemented effectively, CDC can meet today’s scalability, efficiency, real-time, and zero-impact requirements.

Introduction: The Rise of Modern Data Architectures

Data is creating massive waves of change and giving rise to a new data-driven economy that is only beginning. Organizations in all industries are changing their business models to monetize data, understanding that doing so is critical to competition and even survival. There is tremendous opportunity as applications, instrumented devices, and web traffic are throwing off reams of 1s and 0s, rich in analytics potential.

These analytics initiatives can reshape sales, operations, and strategy on many fronts. Real-time processing of customer data can create new revenue opportunities. Tracking devices with Internet of Things (IoT) sensors can improve operational efficiency, reduce risk, and yield new analytics insights. New artificial intelligence (AI) approaches such as machine learning can accelerate and improve the accuracy of business predictions. Such is the promise of modern analytics.

However, these opportunities change how data needs to be moved, stored, processed, and analyzed, and it’s easy to underestimate the resulting organizational and technical challenges. From a technology perspective, to achieve the promise of analytics, underlying data architectures need to efficiently process high volumes of fast-moving data from many sources. They also need to accommodate evolving business needs and multiplying data sources. To adapt, IT organizations are embracing data lake, streaming, and cloud architectures. These platforms are complementing and even replacing the enterprise data warehouse (EDW), the traditional structured system of record for analytics. Figure I-1 summarizes these shifts.

Enterprise architects and other data managers know firsthand that we are in the early phases of this transition, and it is tricky stuff. A primary challenge is data integration—the second most likely barrier to Hadoop Data Lake implementations, right behind data governance, according to a recent TDWI survey (source: “Data Lakes: Purposes, Practices, Patterns and Platforms,” TDWI, 2017). IT organizations must copy data to analytics platforms, often continuously, without disrupting production applications (a trait known as zero-impact). Data integration processes must be scalable, efficient, and able to absorb high data volumes from many sources without a prohibitive increase in labor or complexity. Table I-1 summarizes the key data integration requirements of modern analytics initiatives.

All this entails careful planning and new technologies because traditional batchoriented data integration tools do not meet these requirements. Batch replication jobs and manual extract, transform, and load (ETL) scripting procedures are slow, inefficient, and disruptive. They disrupt production, tie up talented ETL programmers, and create network and processing bottlenecks. They cannot scale sufficiently to support strategic enterprise initiatives. Batch is unsustainable in today’s enterprise.

Enter Change Data Capture

Change data capture (CDC) software enables continuous incremental replication by identifying and copying data updates as they take place. This makes it a foundational technology for modernizing your data environment. When designed and implemented effectively, CDC can meet today’s scalability, efficiency, real-time, and zero-impact requirements. Without CDC, organizations usually fail to meet modern analytics requirements. They must stop or slow production activities for batch runs, hurting efficiency and decreasing business opportunities. They cannot integrate enough data, fast enough, to meet analytics objectives. They lose business opportunities, lose customers, and break operational budgets.

Why Use Change Data Capture?

Change data capture (CDC) continuously identifies and captures incremental changes to data and data structures (aka schemas) from a source such as a production database. CDC arose two decades ago to help replication software deliver real-time transactions to data warehouses, where the data is then transformed and delivered to analytics applications. Thus, CDC enables efficient, low-latency data transfer to operational and analytics users with low production impact. Let’s walk through the business motivations for a common use of replication: offloading analytics queries from production applications and servers. At the most basic level, organizations need to do two things with data:

  • Record what’s happening to the business—sales, expenditures, hiring, and so on.

  • Analyze what’s happening to assist decisions—which customers to target, which costs to cut, and so forth—by querying records.

The same database typically cannot support both of these requirements for transaction-intensive enterprise applications, because the underlying server has only so much CPU processing power available. It is not acceptable for an analytics query to slow down production workloads such as the processing of online sales transactions. Hence the need to analyze copies of production records on a different platform. The business case for offloading queries is to both record business data and analyze it, without one action interfering with the other. The first method used for replicating production records (i.e., rows in a database table) to an analytics platform is batch loading, also known as bulk or full loading. This process creates files or tables at the target, defines their “metadata” structures based on the source, and populates them with data copied from the source as well as the necessary metadata definitions.

Batch loads and periodic reloads with the latest data take time and often consume significant processing power on the source system. This means administrators need to run replication loads during “batch windows” of time in which production is paused or will not be heavily affected. Batch windows are increasingly unacceptable in today’s global, 24×7 business environment.

The Role of Metadata in CDC

Metadata is data that describes data. In the context of replication and CDC, primary categories and examples of metadata include the following:

Infrastructure
Servers, sources, targets, processes, and resources

Users
Usernames, roles, and access controls

Logical structures
Schemas, tables, versions, and data profiles

Data instances
Files and batches

Metadata plays a critical role in traditional and modern data architectures. By describing datasets, metadata enables IT organizations to discover, structure, extract, load, transform, analyze, and secure the data itself. Replication processes, be they either batch load or CDC, must be able to reliably copy metadata between repositories.

Here are real examples of enterprise struggles with batch loads (in Chapter 4, we examine how organizations are using CDC to eliminate struggles like these and realize new business value):

  • A Fortune 25 telecommunications firm was unable to extract data from SAP ERP and PeopleSoft fast enough to its data lake. Laborious, multitier loading processes created day-long delays that interfered with financial reporting.

  • A Fortune 100 food company ran nightly batch jobs that failed to reconcile orders and production line-items on time, slowing plant schedules and preventing accurate sales reports.

  • One of the world’s largest payment processors was losing margin on every transaction because it was unable to assess customer-creditworthiness ihouse in a timely fashion. Instead, it had to pay an outside agency.

  • A major European insurance company was losing customers due to delays in its retrieval of account information.

Each of these companies eliminated their bottlenecks by replacing batch replication with CDC. They streamlined, accelerated, and increased the scale of their data initiatives while minimizing impact on production operations.

Advantages of CDC

CDC has three fundamental advantages over batch replication:

  • It enables faster and more accurate decisions based on the most current data; for example, by feeding database transactions to streaming analytics applications.

  • It minimizes disruptions to production workloads.

  • It reduces the cost of transferring data over the wide area network (WAN) by sending only incremental changes.

Together these advantages enable IT organizations to meet the real-time, efficiency, scalability, and low-production impact requirements of a modern data architecture. Let’s explore each of these in turn.

Streaming Change Data Capture

Learn how to modernize your data and analytics environment with scalable, efficient and real-time data replication that does not impact production systems.

Faster and More Accurate Decisions

The most salient advantage of CDC is its ability to support real-time analytics and thereby capitalize on data value that is perishable. It’s not difficult to envision ways in which real-time data updates, sometimes referred to as fast data, can improve the bottom line.

For example, business events create data with perishable business value. When someone buys something in a store, there is a limited time to notify their smartphone of a great deal on a related product in that store. When a customer logs into a vendor’s website, this creates a short-lived opportunity to cross-sell to them, upsell to them, or measure their satisfaction. These events often merit quick analysis and action.

In a 2017 study titled The Half Life of Data, Nucleus Research analyzed more than 50 analytics case studies and plotted the value of data over time for three types of decisions: tactical, operational, and strategic. Although mileage varied by example, the aggregate findings are striking:

  • Data used for tactical decisions, defined as decisions that prioritize daily tasks and activities, on average lost more than half its value 30 minutes after its creation. Value here is measured by the portion of decisions enabled, meaning that data more than 30 minutes old contributed to 70% fewer operational decisions than fresher data. Marketing, sales, and operations personnel make these types of decisions using custom dashboards or embedded analytics capabilities within customer relationship management (CRM) and/or supply-chain management (SCM) applications.

  • Operational data on average lost about half its value after eight hours. Examples of operational decisions, usually made over a few weeks, include improvements to customer service, inventory stocking, and overall organizational efficiency, based on data visualization applications or Microsoft Excel.

  • Data used for strategic decisions has the longest-range implications, but still loses half its value roughly 56 hours after creation (a little less than two and a half days). In the strategic category, data scientists and other specialized analysts often are assessing new market opportunities and significant potential changes to the business, using a variety of advanced statistical tools and methods.

Figure 1-1 plots Nucleus Research’s findings. The Y axis shows the value of data to decision making, and the X axis shows the hours after its creation

Examples bring research findings like this to life. Consider the case of a leading European payments processor, which we’ll call U Pay. It handles millions of mobile, online and in-store transactions daily for hundreds of thousands of merchants in more than 100 countries. Part of U Pay’s value to merchants is that it credit-checks each transaction as it happens. But loading data in batch to the underlying data lake with Sqoop, an open source ingestion scripting tool for Hadoop, created damaging bottlenecks. The company could not integrate both the transactions from its production SQL Server and Oracle systems and credit agency communications fast enough to meet merchant demands.

U Pay decided to replace Sqoop with CDC, and everything changed. The company was able to transact its business much more rapidly and bring the credit checks in house. U Pay created a new automated decision engine that assesses the risk on every transaction on a near-real-time basis by analyzing its own extensive customer information. By eliminating the third-party agency, U Pay increased margins and improved service-level agreements (SLAs) for merchants.

Indeed, CDC is fueling more and more software-driven decisions. Machine learning algorithms, an example of artificial intelligence (AI), teach themselves as they process continuously changing data. Machine learning practitioners need to test and score multiple, evolving models against one another to generate the best results, which often requires frequent sampling and adjustment of the underlying datasets. This can be part of larger cognitive systems that also apply deep learning, natural-language processing (NLP), and other advanced capabilities to understand text, audio, video, and other alternative data formats.

Minimizing Disruptions to Production

By sending incremental source updates to analytics targets, CDC can keep targets continuously current without batch loads that disrupt production operations. This is critical because it makes replication more feasible for a variety of use cases. Your analytics team might be willing to wait for the next nightly batch load to run its queries (although that’s increasingly less common). But even then, companies cannot stop their 24×7 production databases for a batch job. Kill the batch window with CDC and you keep production running full-time. You also can scale more easily and efficiently carry out high-volume data transfers to analytics targets.

Reducing WAN Transfer Cost

Cloud data transfers have in many cases become costly and time-consuming bottlenecks for the simple reason that data growth has outpaced the bandwidth and economics of internet transmission lines. Loading and repeatedly reloading data from on-premises systems to the cloud can be prohibitively slow and costly.

“It takes more than two days to move a terabyte of data across a relatively speedy T3 line (20 GB/hour),” according to Wayne Eckerson and Stephen Smith in their Qlik-commissioned report “Seven Considerations When Building a Data Warehouse Environment in the Cloud” (April 2017). They elaborate:

And that assumes no service interruptions, which might require a full or partial restart…Before loading data, administrators need to compare estimated data volumes against network bandwidth to ascertain the time required to transfer data to the cloud. In most cases, it will make sense to use a replication tool with built-in change data capture (CDC) to transfer only deltas to source systems. This reduces the impact on network traffic and minimizes outages or delays.

In summary, CDC helps modernize data environments by enabling faster and more accurate decisions, minimizing disruptions to production, and reducing cloud migration costs. An increasing number of organizations are turning to CDC, both as a foundation of replication platforms such as Qlik Replicate® and as a feature of broader extract, transform, and load (ETL) offerings such as Microsoft SQL Server Integration Services (SSIS). It uses CDC to meet the modern data architectural requirements of real-time data transfer, efficiency, scalability, and zero-production impact. In Chapter 2, we explore the mechanics of how that happens.

Learn More About Data Integration With Qlik