Back to Basics
The purpose of the flow of any data pipeline is to simply move data stored in a prescribed format and structure, from one place to another. The data origin is called a source and the destination is known as a target, sometimes referred to as a sink. Two patterns describe the process, but neither prescribe duration, frequency, transport technology, programming language or tools. The patterns are as follows:
- E T L – Stands for Extract, Transform, Load and describes exactly what happens at each stage of the pipeline. The data is first extracted from the source and then transformed in some manner. Finally, the data subset is loaded into the target system.
- E L T – The Extract, Load, Transform pattern is similar. The pipeline starts by, again, extracting a subset of data from the source, but then immediately loads it into the target. The final step performs the data transformation.
The only difference between the two patterns is, obviously, when you perform data transformation. Note that neither pattern dictates if the transformation occurs before, during or after the data has been transported. For example, let’s examine the ETL pattern.
The diagram below illustrates that the subset of data is transported across the wire BEFORE the transformation and final load occurs.
By the same token, it’s also equally valid to extract and transform the data subset BEFORE transmission and eventual load.
In reality, vendor implementations often dictate the order and precedence of the data transport operation. In fact, much of the implementation details mentioned earlier (i.e., frequency, etc.) are also highly vendor dependent.
ETL and Batch
Generally, ETL processes operate on a predetermined schedule, such as every minute, hour, day or week, depending on the use case. Note, ETL pipelines can also run in response to an external trigger or event, but this is less common.
A scheduled ETL process is said to operate in a batch mode, with the frequency often dictated by the following constraints:
- Timeliness of data required.
- Time to extract the data.
- The duration of the transformation.
- Time to transmit the data.
- Time to perform the load.
- Transaction volume between batches.
Overall, this process works well but struggles when the volume of data and ETL processing time exceed the desired timeliness. For example, a bank may need to update their data warehouse with a million transactions every 10 minutes, but it takes 15 minutes to extract, transform and load the batch. Extending the frequency to 20 minutes is not the answer because the volume of data has also increased to two million rows respectively.
What do they do?
ELT, CDC, and Real-time
The bank’s alternate strategy is to rethink the process and reprioritize when the different steps occur. If we assume that the time to extract, transport and load the data is the same as before, then using ELT allows the backend to perform the transformation, potentially when more resources are available.
This pattern can be further enhanced by adding Change Data Capture (CDC). CDC doesn’t operate on a batch schedule like ETL, but is triggered every time a change occurs to the data source. Therefore, in our bank example, the ELT process runs for every transaction and a minimal amount of data is transmitted over the wire. There’s no waiting to process a million rows. Effectively, the extract and load processes happen in real-time.
The bank, then, has the option to schedule a bulk transform process or defer the transformation until the data is consumed. Typically, we find customers employ both of those options.
Moreover, the volume, velocity and variety of data has grown so massively in recent years that ELT has replaced ETL in many instances as the de facto pattern for data movement, especially in scenarios such as cloud data migration, data warehouse and lake ingestion and ML Ops – the continuous delivery and automation of data pipelines using machine learning.
We started this article debating whether ETL or ELT is better pattern for data pipelines and settled on the unsatisfying answer of “it depends.” Although ETL has traditionally been the staple of data integration, the truth is, however, that timeliness matters and ETL stutters. Therefore, if you need data in real-time for your analytics or machine learning projects, then ELT is the pattern of choice.
If you want to try your hand at moving data in real-time using an award-winning ELT solution, then take Qlik Data Integration for a test drive, trialing Qlik Replicate here.