Running on a horizontally scalable cluster of commodity servers, Apache Kafka ingests real-time data from multiple "producer" systems and applications—such as logging systems, monitoring systems, sensors, and IoT applications—and at very low latency makes the data available to multiple "consumer" systems and applications such as real-time analytics.
Consumers can range from analytics platforms to applications that rely on real-time data processing. Examples include logistics or location-based micromarketing applications.
Here we define the terms shown above:
Producer: Client application that push events into topics
Cluster: One or more servers (called brokers) running Apache Kafka.
Topic: The method to categorize and durably store events. There are two types of topics: compacted and regular. Records in compacted topics do not expire based on time or space bounds. Newer topic messages update older messages that possess the same key and Apache Kafka does not delete the latest message unless deleted by the user. For regular topics, records can be configured to expire, deleting old data to free storage space.
Partition: The mechanism to distribute data across multiple storage servers (brokers). Messages are indexed and stored together with a timestamp and ordered by the position of the message within a partition. Partitions are distributed across a node cluster and are replicated to multiple servers to ensure that Apache Kafka delivers message streams in a fault-tolerant manner.
Consumers: Client applications which read and process the events from partitions. The Apache Kafka Streams API allows writing Java applications which pull data from Topics and write results back to Apache Kafka. External stream processing systems such as Apache Spark, Apache Apex, Apache Flink, Apache NiFi and Apache Storm can also be applied to these message streams.