Tools

Tools: Kafka Guarantees Delivery, Not Uniqueness: How to Build Idempotent Systems

2026-01-28 0 views admin

Tools: Kafka Guarantees Delivery, Not Uniqueness: How to Build Idempotent Systems

Source: Dev.to

Understanding the Duplicate Message Scenario ## A Simple Analogy ## Enter Idempotency ## Producer-Side Idempotency: Preventing Duplicates at the Source ## What Kafka's Idempotent Producer Actually Does ## Why Acknowledgements Matter (acks=all) ## The Critical Limitation (This Is Where Many Teams Stop Too Early) ## Application-Level Idempotency: Making Duplicates Detectable ## The Core Idea: Stable Event Identity ## Why This Matters Even with Idempotent Producers ## What Happens Without Stable Event IDs ## With Application-Level Idempotency ## A Key Mindset Shift ## A Reality Check: "Exactly Once" Is a Goal, Not a Guarantee ## Why This Matters ## Kafka's Philosophy Aligns with This Reality ## Consumer-Side Idempotency: The Final Line of Defense ## What the Consumer Must Do ## Why the Consumer Is So Important ## How Consumers Enforce Idempotency in Practice ## The Two Common Deduplication Strategies ## 1. Database-Based Deduplication (Most Reliable) ## 2. Cache-Based Deduplication (Fast but Weaker) ## Choosing Between Them (or Combining Them) ## The Important Ordering Rule ## Why This Still Doesn't Mean "Perfect Idempotency" ## Conclusion: Idempotency Is a System Property, Not a Feature ## What We've Learned ## Why This Matters in Production ## The Right Mental Model Going Forward ## 🔗 Connect with Me If you've worked with Kafka long enough, you've probably seen this happen or you will. A producer sends a message. The consumer processes it. The database write succeeds. And then… something goes wrong. The acknowledgement doesn't come back. The network hiccups. The consumer restarts. Kafka does what it's designed to do: it retries. Suddenly, the same message shows up again. Now you're left staring at duplicated rows, repeated updates, or inconsistent state, wondering: "But didn't Kafka already process this?" Here's the uncomfortable truth: Kafka guarantees delivery, not uniqueness. Kafka is excellent at making sure messages are not lost. But when failures occur and they always do in distributed systems Kafka will retry. And retries mean duplicates, unless your system is designed to handle them. This is where many systems quietly break. Not because Kafka failed. But because the system assumed acknowledgements were reliable. Let's walk through what's happening in the diagram above, step by step. This is an important realization: The duplication happened not because Kafka is broken, but because the producer could not trust the acknowledgement. Kafka chose reliability over guessing. It preferred possibly duplicating a message rather than risking data loss. And that trade-off is intentional. This is exactly why retries are a fundamental part of Kafka and why idempotency becomes essential when building real-world systems on top of it. Think of Kafka like a courier service. You send a package and wait for a confirmation. If the confirmation doesn't arrive, you send the package again just to be safe. From the courier's point of view, that's the correct behavior. From the receiver's point of view, they may now have two identical packages. Kafka behaves the same way. Retries are not a bug. They are a feature. The question is: can your system safely handle receiving the same message more than once? This is where idempotency comes in. At a high level, an operation is idempotent if doing it multiple times produces the same final result as doing it once. Kafka provides some idempotency guarantees at the producer level, which help prevent duplicate messages from being written to Kafka itself during retries. That's important but it's only part of the story. Because even with an idempotent producer: Which means true idempotency is not a single setting. It's a system-wide design choice that spans: In this article, we'll walk through how idempotency actually works in real Kafka systems what Kafka protects you from, what it doesn't, and how to design your pipeline so that retries don't turn into production incidents. No framework-specific code. No marketing promises. Just practical, production-oriented thinking. Let's start at the very beginning of the pipeline the producer. When a producer sends a message to Kafka, it expects an acknowledgement in return. If that acknowledgement doesn't arrive maybe due to a network glitch or a temporary broker issue the producer assumes the message was not delivered and sends it again. From the producer's perspective, this is the safest possible behavior. But without protection, this retry can result in duplicate messages being written to Kafka, even though the original message may have already been stored successfully. To handle this, Kafka provides producer-side idempotency. When producer idempotency is enabled, Kafka ensures that retries from the same producer do not result in duplicate records being written to a partition. Internally, Kafka does this by tracking: If the producer retries a message because it didn't receive an acknowledgement, Kafka can recognize that this is a retry of a previously sent message, not a new one and it avoids writing it again. The result is simple and powerful: Even if the producer retries, Kafka will store the message only once. This gives us a strong guarantee at the Kafka log level. Producer idempotency works correctly only when Kafka is allowed to fully confirm writes. That's why it's typically paired with waiting for acknowledgements from all in-sync replicas. Why does this matter? Because a partial acknowledgement can lie. If the producer receives an acknowledgement before the message is safely replicated, and a failure happens immediately after, Kafka might accept the retry and now you're back to duplicates or lost data. Waiting for full acknowledgements ensures that: At this point, it's tempting to think: "Great producer idempotency is enabled. We're safe." Producer idempotency only guarantees that Kafka won't store duplicate records due to producer retries. It does not guarantee: If multiple producers send logically identical events or if a consumer processes the same message twice Kafka will not stop that. This is an important distinction: Kafka-level idempotency protects delivery. It does not protect business state. And that's why real-world systems need more than just producer idempotency. Once you accept that Kafka alone cannot guarantee uniqueness, the next question becomes: How does the rest of the system recognize a duplicate when it sees one? The answer is application-level idempotency. At this layer, we stop relying on Kafka to "do the right thing" and instead give our system the ability to identify whether an event has already been processed, regardless of how many times it shows up. Application-level idempotency starts with a simple but powerful concept: Every logical event must have a stable, unique identity. This identity is not generated by Kafka. It's generated by the application and travels with the event, end to end. Think of it like a receipt number. If you see the same receipt number twice, you immediately know: In Kafka systems, this typically means attaching an event ID to every message something that uniquely represents what happened, not when it was sent. Producer idempotency prevents Kafka from writing the same send attempt twice. But it cannot answer questions like: Only the application can answer those questions and it can only do so if events are identifiable. That's why application-level idempotency is about business correctness, not messaging mechanics. Without a stable identifier, the system has no memory. When a duplicate message arrives, the consumer has no way to know: So the system does the only thing it can do: process it again. This is how duplicates silently turn into: And by the time you notice, the damage is already done. When every event carries a stable ID, the system can make an informed decision. At the consumer side, the flow becomes: Now retries stop being dangerous. They become harmless repetitions. This is the mental shift many teams miss: Retries are inevitable. Duplicates are optional if your system can recognize them. Kafka will retry. Networks will fail. Consumers will restart. Application-level idempotency is how you design a system that remains correct anyway. Before we talk about consumer-side idempotency, it's important to set expectations. In distributed systems, achieving 100% idempotency across all components is theoretically impossible. This isn't a limitation of Kafka. It's a property of distributed systems themselves. There will always be edge cases where the system cannot know, with absolute certainty, whether an operation already happened or not. So when we talk about "exactly-once" behavior in Kafka-based systems, what we really mean is: Practically exactly-once under well-defined failure scenarios. The goal is not perfection. The goal is controlled correctness. Many teams approach idempotency expecting a magic switch a configuration that eliminates duplicates forever. That switch does not exist. Instead, what Kafka and good system design give you is: Idempotency is about minimizing harm, not eliminating retries. Kafka intentionally chooses: Because losing data is usually worse than processing it twice. This means Kafka pushes the final responsibility for correctness up to the application. That's not a weakness. It's a design decision. With that reality in mind, we now arrive at the most critical part of the system: the consumer. Consumers will still: Which means the consumer must assume: "Every message I receive could be a duplicate." Consumer-side idempotency is where this assumption is enforced. At a high level, the consumer's job is simple: This check typically happens before any irreversible side effects especially database writes. If the consumer does not perform this check, all previous idempotency efforts can still collapse at the last step. The consumer is the only component that: That makes it the last opportunity to prevent duplicates from becoming permanent. If duplicates reach the database unchecked, the system has already lost. At the consumer layer, idempotency stops being a theory and becomes a decision-making process. The consumer receives a message and must answer one question before doing anything else: Have I already processed this event? Everything else flows from that. In practice, consumers enforce idempotency using one of two mechanisms sometimes both. In this approach, the database itself becomes the source of truth for idempotency. The idea is simple: This works well because: From the consumer's point of view: The key benefit here is correctness under crashes. Even if: …the database prevents corruption. That's why database-level idempotency is often the strongest safety net in Kafka systems. Some systems use an in-memory cache or distributed cache to track processed event IDs. This approach is typically chosen for: The flow looks like this: This can work well, but it comes with trade-offs: Which means: Cache-based deduplication improves performance, but cannot be the only line of defense if correctness is critical. Many production systems use cache as an optimization, with the database still acting as the final authority. There is no universal right answer. The choice depends on: This balances performance and safety. One subtle but critical rule applies regardless of strategy: Deduplication must happen before side effects. Once the consumer performs an irreversible action such as writing to a database or triggering an external call it's already too late to ask whether the event was a duplicate. This is why idempotency checks are placed at the very start of message processing. Even with all of this in place, edge cases still exist. There are moments where: At that point, the system relies entirely on idempotency to remain correct. And this brings us back to the earlier reality check: Idempotency doesn't eliminate retries. It makes retries safe. That's the real objective. Kafka is very good at one thing: making sure data is not lost. It is intentionally not responsible for ensuring that data is processed only once everywhere. That responsibility belongs to the system built on top of Kafka, not Kafka itself. The approaches discussed in this article represent one practical way to achieve idempotency in real-world systems but they are not the only way. Depending on the ecosystem and tooling you use, there may be other mechanisms available: These can simplify implementation, but they do not change the underlying requirement: the system must still be designed to tolerate retries and detect duplicates. Frameworks can help. They cannot replace sound system design. And that's the key takeaway. Idempotency is not: It is a system-wide design decision. Let's zoom out and connect the dots. Or put simply: Kafka guarantees delivery. Your system must guarantee correctness. In small systems, duplicates might look harmless. In large systems with high throughput, retries, restarts, and partial failures duplicates silently accumulate and eventually surface as: Idempotency is cheaper than recovery. Systems that are designed to tolerate retries age far better than systems that assume they won't happen. When designing Kafka-based systems, ask these questions early: If you can answer those clearly, your system is already ahead of most. Kafka will retry. Failures will happen. Duplicates will appear. Idempotency is how you make all of that safe. 📖 Blog by Naresh B. A. 👨‍💻 Building AI & ML Systems | Backend-Focused Full Stack 🌐 Portfolio: Naresh B A 📫 Let's connect on LinkedIn | GitHub: Naresh B A Thanks for spending your precious time reading this it's a personal, non-techy little corner of my thoughts, and I really appreciate you being here. ❤️ Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse - Kafka guarantees delivery, not uniqueness. Retries are expected. - Acknowledgements can fail even when a write succeeds, leading to duplicates. - Producer idempotency prevents duplicate writes to Kafka, not duplicate business effects. - Application-level idempotency gives each event a stable identity. - Consumers must assume every message can be a duplicate. - Databases (and sometimes caches) enforce idempotency at the point of side effects. - "Exactly-once" is a practical goal, not a perfect guarantee. - Idempotency doesn't eliminate retries; it makes retries safe. - The producer sends a message (Y) to the Kafka broker. - The broker successfully appends this message to the topic partition. So far, everything is working as expected. - However, when the broker sends the acknowledgement back to the producer, that acknowledgement fails to reach the producer maybe due to a temporary network issue or a timeout. From the producer's point of view, it has no way of knowing whether the message was actually written or not. - So the producer does the only safe thing it can do: it retries and sends the same message (Y) again. - The broker receives this retry and, without additional safeguards, appends the message again to the same partition. Now the topic contains two identical messages even though the producer intended to send it only once. - Processing the same event twice should not corrupt your data. - Writing the same record again should not create duplicates. - Retrying should be safe, not dangerous. - consumers can retry - acknowledgements can fail - databases can be written to more than once - and the database itself - a unique identity for the producer session, and - a sequence number for each message sent to a given partition - Kafka has durably stored the message. - retries are handled safely. - producer idempotency can actually do its job. - Fast acknowledgements optimize latency. - Strong acknowledgements protect correctness. - uniqueness across different producers - uniqueness across restarts - uniqueness at the consumer or database level - business-level correctness - this isn't a new action - it's a retry or a duplicate - processing it again would be incorrect - Did another producer emit the same logical event? - Did this consumer restart and reprocess the message? - Did a downstream write succeed even though the ack failed? - whether this event is new - whether it was already applied - whether processing it again would cause harm - double inserts - incorrect counters - repeated state transitions - corrupted aggregates - Receive event. - Check whether this event ID was already seen. - If yes → skip or safely ignore. - If no → process and record the ID. - independent processes - network partitions - and multiple sources of truth - deterministic behavior - bounded failure modes - safe retries - and recoverable state - at-least-once delivery - explicit retries - clear failure semantics - idempotent producers - stable event IDs - careful message design - reprocess messages - see the same event more than once - Receive an event. - Check whether this event ID has already been processed. - Decide whether to: apply the change skip it or safely update existing state - apply the change - or safely update existing state - apply the change - or safely update existing state - sees the final event - performs the side effect - mutates durable state - every event has a stable event ID - the database enforces uniqueness for that ID - duplicate writes are either ignored or treated as no-ops - databases are durable - uniqueness constraints are enforced atomically - retries become safe by design - if the write succeeds → the event was new - if the write fails due to duplication → the event was already processed - the consumer restarts - the same message is processed again - the acknowledgement failed previously - very high throughput - extremely low latency - short-lived deduplication windows - consumer checks cache for event ID - if present → skip - if not → process and store ID in cache - cache entries expire - cache can be evicted - cache can be lost on failure - how harmful duplicates are - how long duplicates can appear - how much latency you can tolerate - how much complexity you're willing to manage - cache for fast, short-term duplicate filtering - database for long-term correctness - a write succeeds - the consumer crashes - the acknowledgement never happens - the message is retried - language-level abstractions - framework-provided annotations - transactional helpers - or platform-specific guarantees - a Kafka configuration - a producer setting - a consumer option - or a database trick - Kafka retries are expected not exceptional. - Acknowledgements can fail even when writes succeed. - Producer idempotency prevents duplicate writes to Kafka, not duplicate business effects. - Application-level event identity makes duplicates detectable. - Consumer-side idempotency is the final line of defense. - Databases and caches enforce correctness when retries happen. - "Exactly-once" is a practical goal, not a mathematical guarantee. - incorrect data - broken invariants - painful backfills - and long debugging sessions - What uniquely identifies a business event? - What happens if this message is processed twice? - Where is duplication detected? - Where is correctness enforced? - What happens when acknowledgements lie?

🏷️ Tags

how-totutorialguidedev.toaimlnetworkswitchdatabasegitgithub