The transition from "What happened?" to "Why did it happen?" on "The Analytics Continuum" is the most challenging step to take because the capability to answer these questions requires an entirely new approach to data management. This step requires access to data in real-time as the events happen and a traditional extract, transform, load (ETL) process can't support actionable feedback.
An excellent real-time data processing architecture needs to be fault-tolerant and scalable; it needs to support batch and incremental updates and be extensible.
Here we explore two essential data processing architectures known as Lambda and Kappa that serve as the backbone of various enterprise application.
Lambda architecture comprises Batch Layer, Speed Layer (also known as Stream layer), and Serving Layer.
The batch layer operates on the complete data and thus allows the system to produce the most accurate results. However, the results come at the cost of high latency due to high computation time. The batch layer stores the raw data as it arrives and computes the batch views for consumption. Naturally, batch processes will occur at some interval and will be long-lived. The scope of data is anywhere from hours to years.
The speed layer generates results in a low-latency, near real-time fashion. The speed layer is used to compute the real-time views to complement the batch views. The speed layer receives the arriving data and performs incremental updates to the batch layer results. Thanks to the incremental algorithms implemented at the speed layer, the computation cost is significantly reduced.
The batch views may be processed with more complex or expensive rules and may have better data quality and less skew, while the real-time views give you up-to-the-moment access to the latest possible data.
Finally, the serving layer enables various queries of the results sent from the batch and speed layers. The outputs from the batch layer in the form of batch views and the speed layer in the form of near-real-time views are forwarded to the serving layer, which uses this data to cater to the pending queries on an ad-hoc basis.
Architecture Overview:
Implementation Example:
Pros
Cons
Use Cases
The Kappa architecture solves the redundant part of the Lambda architecture. It is designed with the idea of replaying data. Kappa architecture avoids maintaining two different code bases for the batch and speed layers. The key idea is to handle real-time data processing, and continuous data reprocessing using a single stream processing engine and avoid a multi-layered Lambda architecture while meeting the standard quality of service.
Architecture Overview:
Implementation Example:
Pros
Cons
Use Cases
Kappa is not a replacement for Lambda as some use-cases deployed using the Lambda architecture cannot be migrated.
When you seek an architecture that is more reliable in updating the data lake as well as efficient in training the machine learning models to predict upcoming events robustly, then use the Lambda architecture as it reaps the benefits of both the batch layer and speed layer to ensure few errors and speed.
On the other hand, when you want to deploy big data architecture using less expensive hardware and require it to deal effectively with unique events occurring continuously, then select the Kappa architecture for your real-time data processing needs.
Check out this whitepaper and related webinar to deep dive into this topic.