Apache Kafka

{{Multiple issues|

}}

{{Infobox software

| name = Apache Kafka

| title = Apache Kafka{{cite web|url=https://github.com/apache/kafka|title=Apache Kafka at GitHub|work=github.com|access-date=5 March 2018|archive-date=16 January 2023|archive-url=https://web.archive.org/web/20230116213842/https://github.com/apache/kafka|url-status=live}}

| logo = File:Apache Kafka logo.svg

| author = LinkedIn

| developer = Apache Software Foundation

| released = {{Start date and age|2011|01}}{{cite web|title=Open-sourcing Kafka, LinkedIn's distributed message queue|url=https://blog.linkedin.com/2011/01/11/open-source-linkedin-kafka|access-date=27 October 2016|archive-date=26 December 2022|archive-url=https://web.archive.org/web/20221226020822/https://blog.linkedin.com/2011/01/11/open-source-linkedin-kafka/|url-status=live}}

| discontinued = No

| latest release version = {{wikidata|property|preferred|references|edit|Q16235208|P348|P548=Q2804309}}

| latest release date = {{wikidata|qualifier|preferred|single|Q16235208|P348|P548=Q2804309|P577}}

| programming language = Scala, Java

| operating system = Cross-platform

| genre = Stream processing, Message broker

| license = Apache License 2.0

| website = {{Official URL}}

}}

Apache Kafka is a distributed event store and stream-processing platform. It is an open-source system developed by the Apache Software Foundation written in Java and Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Kafka can connect to external systems (for data import/export) via Kafka Connect, and provides the Kafka Streams libraries for stream processing applications. Kafka uses a binary TCP-based protocol that is optimized for efficiency and relies on a "message set" abstraction that naturally groups messages together to reduce the overhead of the network roundtrip. This "leads to larger network packets, larger sequential disk operations, contiguous memory blocks [...] which allows Kafka to turn a bursty stream of random message writes into linear writes."{{cite web|url=https://kafka.apache.org/documentation/#maximizingefficiency |title=Efficiency |website=kafka.apache.org|access-date=2019-09-19}}

History

Kafka was originally developed at LinkedIn, and was subsequently open sourced in early 2011. Jay Kreps, Neha Narkhede and Jun Rao helped co-create Kafka.Li, S. (2020). He Left His High-Paying Job At LinkedIn And Then Built A $4.5 Billion Business In A Niche You've Never Heard Of. Forbes. Retrieved 8 June 2021, from [https://www.forbes.com/sites/stevenli1/2020/05/11/confluent-jay-kreps-kafka-4-billion-2020/?sh=1a82e619709d Forbes_Kreps] {{Webarchive|url=https://web.archive.org/web/20230131210616/https://www.forbes.com/sites/stevenli1/2020/05/11/confluent-jay-kreps-kafka-4-billion-2020/?sh=1a82e619709d |date=2023-01-31 }}. Graduation from the Apache Incubator occurred on 23 October 2012.{{Cite web|title=Apache Incubator: Kafka Incubation Status|url=https://incubator.apache.org/projects/kafka.html|access-date=2022-10-17|archive-date=2022-10-17|archive-url=https://web.archive.org/web/20221017081525/https://incubator.apache.org/projects/kafka.html|url-status=live}} Jay Kreps chose to name the software after the author Franz Kafka because it is "a system optimized for writing", and he liked Kafka's work.{{cite book |last1=Narkhede |first1=Neha |last2=Shapira |first2=Gwen |last3=Palino |first3=Todd |title=Kafka: The Definitive Guide |date=2017 |publisher=O'Reilly |isbn=978-1-4919-3611-5 |url=https://books.google.com/books?id=dXwzDwAAQBAJ |chapter=Chapter 1 |quote=People often ask how Kafka got its name and if it has anything to do with the application itself. Jay Kreps offered the following insight: "I thought that since Kafka was a system optimized for writing using, a writer's name would make sense. I had taken a lot of lit classes in college and liked Franz Kafka."}}

Operation

Apache Kafka is a distributed log-based messaging system that guarantees ordering within individual partitions rather than across the entire topic. Unlike queue-based systems, Kafka retains messages in a durable, append-only log, allowing multiple consumers to read at different offsets. Kafka uses manual offset management, giving consumers control over retries and failure handling. If a consumer fails to process a message, it can delay committing the offset, preventing further progress in that partition while other partitions remain unaffected. This partition-based design enables fault isolation and parallel processing while allowing ordering to be maintained within partitions, depending on consumer handling.{{Cite book |last=Narkhede |first=Neha |title=Kafka: the definitive guide: real-time data and stream processing at scale |last2=Shapira |first2=Gwen |last3=Palino |first3=Todd |date=2017 |publisher=O'Reilly Media |isbn=978-1-4919-3616-0 |location=Sebastopol, CA |oclc=933521388}}{{page needed|date=May 2025}}

In 2025, Apache Kafka introduced "Queues for Kafka",[https://cwiki.apache.org/confluence/display/KAFKA/KIP-932%3A+Queues+for+Kafka KIP-932: Queues for Kafka] adding share groups as an alternative to consumer groups. This feature enables queue-like semantics where consumers can cooperatively process records from the same partitions, with individual message acknowledgment and delivery tracking. Unlike traditional consumer groups where partitions are exclusively assigned, share groups allow the number of consumers to exceed partition count, making it ideal for work-queue patterns while maintaining Kafka's durability and scalability benefits. This development addresses the common challenge of "over-partitioning" that many Kafka users face.{{cn|date=May 2025}}

Kafka APIs

=Connect API=

Kafka Connect (or Connect API) is a framework to import/export data from/to other systems.{{Cite web|title=Apache Kafka Documentation: Kafka Connect |website=Apache |url=https://kafka.apache.org/documentation/#connect |language=en}} It was added in the Kafka 0.9.0.0 release and uses the Producer and Consumer API internally. The Connect framework itself executes so-called "connectors" that implement the actual logic to read/write data from other systems. The Connect API defines the programming interface that must be implemented to build a custom connector. Many open source and commercial connectors for popular data systems are available already. However, Apache Kafka itself does not include production ready connectors.

=Streams API=

Kafka Streams (or Streams API) is a stream-processing library written in Java. It was added in the Kafka 0.10.0.0 release. The library allows for the development of stateful stream-processing applications that are scalable, elastic, and fully fault-tolerant. The main API is a stream-processing domain-specific language (DSL) that offers high-level operators like filter, map, grouping, windowing, aggregation, joins, and the notion of tables. Additionally, the Processor API can be used to implement custom operators for a more low-level development approach. The DSL and Processor API can be mixed, too. For stateful stream processing, Kafka Streams uses RocksDB to maintain local operator state. Because RocksDB can write to disk, the maintained state can be larger than available main memory. For fault-tolerance, all updates to local state stores are also written into a topic in the Kafka cluster. This allows recreating state by reading those topics and feed all data into RocksDB.{{cite web |title=Kafka Connect – Import Export for Apache Kafka |url=https://softwaremill.com/import-export-through-kafka-connectors/ |access-date=2025-05-08 |website=SoftwareMill}}