As I write this blog post, it is during an unusually hot summer and during lockdown. I’ve taken the opportunity of an evening to dust off the CDs (remember them?) to have a music revival by genre. It was during a ska session, when the much covered “A Message To You Rudy” – by UK ska band The Specials – came on. It got me thinking about data streaming messages.
What Is Data Streaming?
Data streaming is designed to deliver real-time insight that can improve your business and give it a competitive edge. Data is processed in real-time from potentially thousands of sources, such as sensors, financial trading floor transactions, e-commerce purchases, web and mobile applications, social networks and many more. By aggregating and analyzing this real-time data, you can use database streaming to gain actionable insights to improve business agility, make better-informed decisions, fine tune operations, improve customer service and act quickly to respond to any opportunity or, indeed, crisis that may come your way.
Effective database streaming requires a sophisticated streaming architecture and Big Data solution like Apache Kafka. Kafka is a fast, scalable and durable publish-subscribe messaging system that can support data stream processing by simplifying data ingest. Kafka can process and execute more than 100,000 transactions per second and is an ideal tool for enabling database streaming to support Big Data analytics and data lake initiatives.
What Is Stream Processing?
Stream processing is a method of performing transformations or generating analytics on transactional data inside a stream. Traditional ETL-based data integration functions are performed on data that will be analysed in a database, data lake or data warehouse. Analytics are typically run within a data warehouse with structured and cleansed data. In contrast, streaming platforms like Apache Kafka enable both integration and in-stream analytics within data as it moves through the stream. Typically, data is ingested into a stream via change data capture (CDC) technologies.
Stream Processing Use Cases
- Real-time analytics – Generate analysis within live streams of data, often computed over defined time windows.
- Microservices integration – Process inputs from other services.
- Log analysis – Filter records from log data sources to look for anomalies or other critical variations.
- Data integration – Support basic to more complex data integration scenarios.
What Is CDC?
CDC is an optimal mechanism for capturing and delivering transactional data from sources into a streaming platform. Stream processing can then take this CDC-generated data and create new streams for additional use cases, or it can generate analysis within the stream of transactions.
The basic concept of stream processing is that database values are converted to a changelog and then reconstructed or used to create new values by operations done in stream processing. This is also a basic example of how CDC replicates source data to a target.
Why Use Log-based CDC?
Log-based CDC results in low to near-zero impact to production sources while creating new streams and performing in-stream analytics in near real-time rather than batch processing. Thus, you can avoid processing duplicate messages, process messages individually or in aggregate, as well as execute time-based aggregation of messages.
However, using Kafka for database streaming can create a variety of challenges. Source systems may be adversely impacted. A significant amount of custom development may be required by highly skilled data engineers. And scaling efficiently to support many and varied data sources can be difficult.
To help ease and overcome these challenges, look to Qlik. We have a strong partnership with Confluent using Qlik Replicate, which leverages CDC. Below are six reasons to choose Confluent and Qlik for your real-time Apache Kafka data streaming.
- Freedom of deployment choice – Deploy and stream on any cloud.
- Development and integration flexibility – Easily integrates with many existing systems as part of a broad and expanding partner ecosystem along with a wide set of heterogeneous sources and pre-built connectors in Confluent.
- Comprehensive management, monitoring and control – Reveal insights about the innerworkings of your Kafka clusters and the data flowing through them. Gain key operational and monitoring capabilities and meet service-level agreements with confidence.
- Save time and effort through automation – Data engineers can greatly reduce their administrative burden and project time by automatically configuring data type conversions with a drag-and-drop interface instead of mapping and scripting them individually for each source type. Use Confluent to automate the process for configuring databases to publish to Kafka.
- Minimal production impact – Qlik’s agentless and log-based CDC approach is faster and has a much lower impact on production sources and workloads while creating new streams and performing in-stream analytics in near real-time rather than alternatives, such as query-based and batch processing.
- Fan-out capabilities – Data engineers can automatically provision a single database producer. Kafka can then map to many streams and consumers, eliminating the need for duplicative configuration processes, minimizing impact on production workloads.
If you want to learn more, please visit http://www.qlik.com/confluent, and I highly recommend reading the eBook, titled “Apache Kafka Transaction Data Streaming for Dummies,” jointly written with Confluent. Reading it will allow you to understand why Confluent and Qlik are ideal solutions for your Apache Kafka data streaming needs.