Data Lake Ingestion

Data lakes have emerged as a critical platform for cost-effectively storing and processing a wide variety of data types. In contrast to the highly structured data warehouse, a data lake stores high volumes of structured, semistructured, and unstructured data in their native formats.

Data lakes, now widely adopted by enterprises, support myriad analytics use cases, including fraud detection, real-time customer offers, market trend and pricing analysis, social media monitoring, and more. And most of these require real-time data.

Within the Apache open source ecosystem, whose primary components are available in distribution packages from Hortonworks, Cloudera, and MapR, the historical batch processing engine MapReduce is increasingly being complemented or replaced by real-time engines such as Spark and Tez. These run on data stored on the Hadoop File System (HDFS) in the data lake.

Batch load data ingestion and CDC replication software can access HDFS, potentially through the high-performance native protocol WebHDFS, which in turn uses the Kerberos open source network protocol to authenticate users. After it is connected, data architects and other users can use either batch load or CDC to land the source data in HDFS in the form of change tables. The data can be refined and queried there, using MapReduce batch processing for historical analysis.

Change Data Capture (CDC) Integration with Spark and Tez

Change Data Capture (CDC) becomes critical when real-time engines like the Apache open source components Tez or Spark come into play.

For example, the Spark in-memory processing framework can apply machine learning or other highly iterative workloads to data in HDFS.

Alternatively, users can transform and merge data from HDFS into Hive data stores with familiar SQL structures that derive from traditional data warehouses. Here again, rapid and real-time analytics requires the very latest data, which requires CDC.

CDC and Amazon Simple Storage Services (S3)

CDC software also can feed data updates in real time to data lakes based on the Amazon Simple Storage Service (S3) distributed object-based file store. S3 is being widely adopted as an alternative to HDFS.

In many cases, it is more manageable, elastic, available, cost-effective, and persistent than HDFS. S3 also integrates with most other components of the Apache Hadoop stack, MapReduce, Spark, or Tez.


Streaming Change Data Capture