Apache Kafka is primarily used to build real-time streaming data pipelines and applications that adapt to data streams. Thousands of companies like Uber, Spotify, and Intuit are using Kafka in their tech stacks.
Apache Kafka exactly-once semantics used to be a challenging problem, it seems to be not possible mathematically. On 30 June 2017 NEHA NARKHEDE, CTO of Confluent published an article that introduce exactly-once semantics in Kafka in the 0.11 release.
Before deep-diving into exactly-once semantics, a basic understanding of types of semantics is required.
Kafa supports three types of messaging delivery semantics:
At-most once: In at-most-once delivery, semantics send and forget premises are followed. With at-most-once delivery semantics, the message is sent by the producer only once. As acknowledgment is not sent in this case, message loss is admissible here. Application supporting at-most-once can accomplish higher throughput but suffer from low latency.
At-least-once: In at-least-once semantics delivering a message more than one time is acceptable with the condition that the message should not be lost. With at-least-once
delivery semantics, the message lost chances reduce to rock-bottom but if the producer retries, a message is sent again resultant duplicated. Applications supporting at-least-once semantics accomplish average throughput and average latency.
Exactly-once: In exactly-once delivery semantics a message delivery is guaranteed exactly once without any duplicates and data loss. It is the most complicated delivery semantics of all. Application supporting exactly-once semantics accomplish lower throughput and higher latency.
What is exactly-once semantics in Kafka?
While sending messages by producers individual brokers might face individual network failure or broker failure issues, which results in producer retries. Its consequences in message duplication.
With Kafka exactly-once semantics the problem is resolved. Even if the producer retries, the message is sent to the end consumer only once. Exactly once ensures no data loss and no duplication. It is the most advantageous assurance semantics among the two semantics.
Why exactly-once semantics is Important?
Producers have no idea what failure occurs on the broker side and it is bound to retries, it causes message duplication. There can be two following cases.
Case 1: The producer is sending a message to the broker, but the broker failed due to network failure resulting in no message stored in the target stream. When the producer does not receive any acknowledgment then it retries. Now new leader broker would consume the message and store it in the target stream and send acknowledgment too. This is a happy path where no duplicates occur.
Case 2: Producer is sending a message to the broker, broker consumed it and stores message in target stream as well but while sending an acknowledgment back to producer broker failed. Broker stored message in target stream but collapse while sending the acknowledgment, therefore producer is obligated to retries. This results in duplicated messages persisting in the target stream.
Exactly-once semantics ensures no data loss and no duplicate messages persist in the target stream.
Kafa exactly-once semantics empower robust messaging system, maintain atomicity and reduce data redundancy well.
Terminology that will be used to understand exactly-once semantics
Using features of Kafka in the 0.11 release, exactly-once semantics empowers the result of processing at each stage. To achieve exactly-once semantics, Kafa introduced two features in the 0.11 release:
- Idempotent Producers Gaurretee: To resist broker processing already persisted messages multiple times, unique Producer ID and sequence number are fastened with a message and sent to the broker. The sequence number starts with zero and follows the rule of monotonically increasing. It prevents accepting presents and accepting duplicate messages from the producer end.
- Transactional Partition Guarantee: Transaction authorizes you to atomically write data to various topics and partitions as well as offset consumed messages. In a single processing step, one or more sources of truth involve, then the computation is performed and dispatched to one or more target topics.
How to implement exactly-once semantics in Kafka?
Two components that are required to achieve exactly-once semantics in Kafka are the Idempotence guarantee and Transaction guarantee. With help of the flow-chart below, you can understand the flow of steps that are followed to achieve exactly-once semantics in Kafka.
Step 1: To acquire idempotent exactly-once semantics feature in producer, set enable.idempotence = true in producer configuration. Producer send operation is now idempotent. In the event of an error that causes a producer to retry, the same message—which is still sent by the producer multiple times—will only be written to the Kafka log-on broker once. We can optionally set transactional. id to a unique value. This unique ID is indispensable to provide continuity of transactional state across application restarts.
Step 2: After enabling idempotence and setting transactional id, now you need to call the producer.initTransaction() method. This command connects the producer and broker. Thus one can use transactions recognized by transactional.id, sequence number, and epoch number.
Step 3: At the consumer end, the isolation.level parameter is required to be set to achieve exactly-once semantics in Kafka. In consumer configuration isolation.level indicates the system should be held behold on to read transactional messages until the corresponding transaction has been executed.
To set isolation.level you have two options read_committed and read_uncommitted.
- read_committed: When the isolation.level is set to read_committed it means that the consumer can read messages written to input topic in the same transaction until y are all written and committed.
- read_uncommitted: consumer can read all messages in offset order without waiting for the transaction containing all messages to be committed.
Step 4: Configuration at both producer and consumer end has been carried out to read and write transactionally, now you can initiate sending messages by calling the producer.beginTransaction() method, this method verifies the transaction was initialized before or not.
Step 5: Now consuming message from producer end to application, processing message, and finally producing message back to Kafka committing consumer offset is done. PID, Epoch number, and Transactional id are used to complete this step. The producer calls the send() method to execute the above-mentioned process.
Step 6: Once the consume-transform-committ loop for all messages gets completed, producer calls commitTransaction() or abortTransaction() and completes the transaction.
Follow this article to know more
In this blog, we talked about types of delivery semantics, how exactly-once semantics is important, how Kafka supported it came up with a simple configuration and less hassle of coding. Kafka instances are used by large-scale businesses and exactly-once semantics is required.