It is no secret that the most compelling operational analysis is increasingly real-time rather than historical. By studying thousands or millions of transactions you can better measure sales performance, customer behavior, and plant efficiency right now. Advanced algorithms such as machine learning help make ever-more-accurate operational predictions.
All this requires real-time pipelines that acquire, transform and analyze high volumes of transactional data with minimal latency at each stage – data acquisition, transformation, and analysis. Picture a futuristic version of the Panama Canal that moves water between locks instantaneously as needed.
The Apache Spark open source in-memory processing engine has rightfully gained a lot of attention for its ability to accelerate the transformation and analysis stages, running on data stores such as S3 and HDFS. For example, the Spark Streaming API can process data within seconds as it arrives from the source or through a Kafka stream.
But there is a problem: latency often lurks upstream.
Real-time processing on the analytics target does not generate real-time insights if the source data flowing into Kafka/Spark is hours or days old. And this is the logjam that change data capture technology (CDC) can break, especially for transactional data. CDC acquires live database transactions and sends copies into the pipeline at near-zero latency, eliminating those slow and bloated batch jobs. It also reduces production processing overhead and cloud transfer bandwidth requirements.
Together CDC and Spark can form the backbone of effective real-time data pipelines. Here are best practices you can employ to make this happen.
Here are two examples of companies that apply these best practices.
Accelerating the Supply Chain for an On-Premises Data Lake
Decision makers at an international food industry leader, which we’ll call “Suppertime,” needed a current view and continuous integration of production capacity data, customer orders, and purchase orders to efficiently process, distribute, and sell tens of millions of chickens each week. But Suppertime struggled to bring together these large datasets, which were distributed across several acquisition-related silos within ten SAP enterprise resource planning (ERP) applications. Using nightly batch replication, they were unable to match orders and production line-item data fast enough. The delays slowed plant operational scheduling and prevented sales teams from filing accurate daily reports.
To streamline the process, Suppertime first selected the right data store, in their case a cost-effective and scalable Hadoop platform. They adopted a new Hortonworks data lake based on Spark and CDC. They now use CDC to efficiently copy SAP record changes every five seconds, decoding that data from complex source SAP pool and cluster tables. CDC injects these source updates, along with any metadata updates, to a Kafka message queue that buffers these high volumes of incoming messages and sends them on request to HDFS and HBase consumers in the data lake. Suppertime has limited the number of topics (a.k.a. streams) to 10, one per source table, to minimize processing overhead.
After the data arrives in HDFS and HBase, Spark in-memory processing helps match orders to production on a real-time basis and maintain referential integrity for purchase order tables. As a result, Suppertime has accelerated sales and product delivery with accurate real-time operational reporting. It has replaced batch loads with CDC to operate more efficiently and more profitably.
Real-Time Operational Reporting
Another example (adapted from a real case study) is a US private equity and venture capital firm that built a data lake to consolidate and analyze operational metrics from its portfolio companies. This firm, which we’ll call “StartupBackers,” opted to host its data lake in the Microsoft Azure cloud rather than taking on the administrative burden of an on-premises infrastructure.
CDC is remotely capturing updates and DDL changes from source databases (Oracle, SQL Server, MySQL, and DB2) at four locations in the United States. Their CDC solution then sends that data through an encrypted File Channel connection over a wide area network (WAN) to a virtual machine–based replication engine in the Azure cloud. This replication engine publishes the data updates to Kafka and on to the DataBricks file system on request, storing those messages in the JSON format.
The Spark platform prepares the data in micro-batches to be consumed by the HDInsight data lake, SQL data warehouse, and various other internal and external subscribers. These targets subscribe to topics that are categorized by source tables. With this CDC-based architecture, StartupBackers is now efficiently supporting real-time analysis without affecting production operations.
Implemented effectively, change data capture technology can serve as a powerful foundation for modern Spark transactional pipelines. That real-time data canal, with efficient locks for immediate transfer, is achievable and real.
To learn more, download the Streaming Change Data Capture: A Foundation for Modern Data Architectures e-book.
This article was written by Kevin Petrie, Senior Director of Product Marketing at Attunity and Jordan Martz, Director of Technology Solutions at Attunity. It was originally published as a guest blog article on the Eckerson Group blog.