What it is, how it works, examples and best practices. This guide provides practical advice to help you understand and manage streaming data.
Streaming data refers to data which is continuously flowing from a source system to a target. It is usually generated simultaneously and at high speed by many data sources, which can include applications, IoT sensors, log files, and servers.
Streaming data architecture allows you to consume, store, enrich, and analyze this flowing data in real-time as it is generated. Real-time analytics gives you deeper insights into your business and customer activity and lets you quickly react to changing conditions. These in-the-moment insights can help you respond faster than your competitors to market events and customer issues.
The most common use cases for data streaming are streaming media, stock trading, and real-time analytics. However, data stream processing is broadly applied in nearly every industry today. This is due to the continuing rise of big data, the Internet of Things (IoT), real-time applications, and customer expectations for personalized recommendations and immediate response.
Streaming data is critical for any applications which depend on in-the-moment information to support the following use cases:
Other examples of applying real-time data streaming include:
Let’s start with an analogy to help frame the concept before we dive into the details. One way to think of streaming data is that it’s like when radio stations constantly broadcast on particular frequencies. (These frequencies are like data topics and you won’t consume them until you turn your processor on to them.) When you tune your radio to a given frequency, your radio picks it up and processes it to become audio you can understand. You want your radio to be fast enough to keep up with the broadcast and if you want a copy of the music you have to record it, because once it’s broadcast, it’s gone.
Two primary layers are needed to process streaming data when using streaming systems like Apache Kafka, Confluent, Google Pub Sub, Amazon Kinesis, and Azure Event Hubs:
There’s a broader cloud architecture needed to execute streaming data to its fullest potential. Stream processing systems like Apache Kafka can consume, store, enrich, and analyze data in motion. And, a number of cloud service companies offer the capability for you to build an “off-the-shelf” data stream. However, these options may not meet your requirements or you may face challenges working with your legacy databases or systems. The good news is that there is a robust ecosystem of tools you can leverage, some of them open source, to build your own “bespoke” data stream.
How to build your own data stream. Here we describe how streaming data works and describe the data streaming technologies for each of the four key steps to building your own data stream.
1. Aggregate all your data sources using a CDC streaming tool from relational databases or transactional systems which may be located on-premises or in the cloud. You will then connect these sources to a stream processor.
2. Build a stream processor using a tool such as Apache Kafka or Amazon Kinesis. The data will typically be processed sequentially and incrementally on a record-by-record basis but it can also be processed over sliding time windows.
Your stream processor should:
3. Query or store the streaming data. Leading tools to do this include Google BigQuery, Snowflake, Amazon Kinesis Data Analytics, and Dataflow. These tools can perform a broad range of analytics such as filtering, aggregating, correlating, and sampling.
There are two approaches to do this:
4. Output for analysis, alerts, real-time applications, data science, and machine learning or AutoML. Once the streaming data has passed through the query or store phase, it can output for multiple use cases:
Traditional batch processing cannot keep up with today's complex, fast-moving data environment. Still, many organizations use both a real-time layer and a batch layer to cover the spectrum of their data processing needs.
Let’s take a side-by-side look at batch vs real-time data processing, and how they can work in tandem to provide a holistic data processing solution for your business.
Batch Processing | Real-Time Stream Processing | |
---|---|---|
Data Type
|
Static, historical data.
|
Dynamic, time-sensitive data.
|
Data Ingestion
|
Loaded as batches of large data sets.
|
Ingests a continual sequence of individual records (or micro batches).
|
Query Scope
|
Queries the entire dataset.
|
Queries only the most recent data record or within a rolling time window.
|
Processing
|
Processes the entire dataset.
|
Processes only the most recent data record or within a rolling time window.
|
Latency
|
Can be minutes to hours.
|
Typically milliseconds.
|
Data Analysis
|
Deep analysis using sophisticated analytics.
|
Response functions, rolling calculations, and aggregates. (Active Intelligence brings more capabilities)
|
The challenges associated with data streaming arise from the character of the stream data itself. As stated above, it flows continuously in real-time, at high velocity and high volume. It’s also often volatile, heterogeneous and incomplete. This results in the following challenges:
Moderna integrazione dei dati che fornisce dati in tempo reale, pronti per le analytics e utilizzabili in qualsiasi ambiente di analytics, da Qlik a Tableau, fino a Power BI e oltre.