Extract. Historically, data would be extracted in bulk using batch-based database queries. The challenge comes as data in the source tables is continuously updated. Completely refreshing a replica of the source data is not suitable and therefore these updates are not reliably reflected in the target repository.
Change data capture solves for this challenge, extracting data in a real-time or near-real-time manner and providing you a reliable stream of change data.
Transformation. Typically, ETL tools transform data in a staging area before loading. This involves converting a data set’s structure and format to match the target repository, typically a traditional data warehouse. Given the constraints of these warehouses, the entire data set must be transformed before loading, so transforming large data sets can be time intensive.
Today’s datasets are too large and timeliness is too important for this approach. In the more modern ELT pipeline (Extract, Load, Transform), data is loaded immediately and then transformed in the target system, typically a cloud-based data warehouse, data lake, or data lakehouse. ELT operates either on a micro-batch timescale, only loading the data modified since the last successful load, or CDC timescale which continually loads data as it changes at the source.
Load. This phase refers to the process of placing the data into the target system, where it can be analyzed by BI or analytics tools.