Let’s start with an analogy to help frame the concept before we dive into the details. One way to think of streaming data is that it’s like when radio stations constantly broadcast on particular frequencies. (These frequencies are like data topics and you won’t consume them until you turn your processor on to them.) When you tune your radio to a given frequency, your radio picks it up and processes it to become audio you can understand. You want your radio to be fast enough to keep up with the broadcast and if you want a copy of the music you have to record it, because once it’s broadcast, it’s gone.
Two primary layers are needed to process streaming data when using streaming systems like Apache Kafka, Confluent, Google Pub Sub, Amazon Kinesis, and Azure Event Hubs:
- Storage. This layer should enable low cost, quick, replayable reads and writes of large data streams by supporting strong consistency and record ordering.
- Processing. This layer consumes and runs computations on data from the storage layer. It also directs the storage layer to remove data no longer needed.
There’s a broader cloud architecture needed to execute streaming data to its fullest potential. Stream processing systems like Apache Kafka can consume, store, enrich, and analyze data in motion. And, a number of cloud service companies offer the capability for you to build an “off-the-shelf” data stream. However, these options may not meet your requirements or you may face challenges working with your legacy databases or systems. The good news is that there is a robust ecosystem of tools you can leverage, some of them open source, to build your own “bespoke” data stream.
How to build your own data stream. Here we describe how streaming data works and describe the data streaming technologies for each of the four key steps to building your own data stream.