Kafka Reference: Topics & Concepts Cheat Sheet
π§ 1. Kafka Fundamentals
π What is Kafka? Use Cases & Design Goals
Apache Kafka is a distributed event streaming platform used for building real-time data pipelines and streaming apps. Itβs designed for high throughput, low latency, and scalability.
Common use cases:
- Log aggregation
- Real-time analytics
- Event sourcing
- Messaging between microservices
- Stream processing (e.g., fraud detection)
Design goals:
- Durable (disk-based storage)
- Scalable (horizontally)
- Fault-tolerant
- High performance
π§± Kafka High-Level Architecture
π§© Brokers
Kafka brokers are servers that store data and serve client requests. Each broker handles a subset of partitions.
Example: If you have 3 brokers and 12 partitions, each broker may manage 4 partitions.
π¦ Topics & Partitions
A topic is a named stream of data. It is split into partitions, which allow for parallelism.
Example:
Topic: orders
Partitions: [0, 1, 2]
Each partition stores an ordered log of messages.
βοΈ Producers & Consumers
- Producers publish messages to topics.
- Consumers subscribe to topics and read messages.
Example:
producer.send(new ProducerRecord<>("orders", "order-123"));
ποΈ ZooKeeper vs KRaft Mode
- ZooKeeper: Older architecture where ZooKeeper handles cluster metadata and leadership.
- KRaft (Kafka Raft Metadata mode): Kafka-native metadata management (no ZooKeeper), simplifying operations.
π§ Kafka Controller
A special broker elected as the controller manages partition leadership, broker registrations, and reassignments.
πͺ΅ Kafkaβs Storage Model: Log-Based Messaging
Each partition is an append-only log, and every message gets a unique offset. Consumers track offsets to read messages.
Example:
Partition 0 log:
Offset 0: order-123
Offset 1: order-124
Consumers can replay or rewind messages by resetting offsets.
β‘ What Makes Kafka Fast?
πΎ Zero-Copy
Kafka uses zero-copy to transfer data from disk to network buffer directly, reducing CPU overhead.
π Sequential Disk Writes
Kafka writes to disk sequentially, which is much faster than random writes.
Why? It minimizes seek time and improves disk I/O efficiency.
π Page Cache
Kafka benefits from the OS's page cache, allowing frequently accessed data to be served from memory.
ποΈ Compression
Kafka supports compression (e.g., Snappy, GZIP), reducing network and disk usage.
Example:
props.put("compression.type", "snappy");
π¦ Batching
Producers and consumers batch multiple records together to reduce overhead and improve throughput.
Example:
props.put("batch.size", 16384);
props.put("linger.ms", 10);
π 2. Core Concepts & Data Flow
π ProducerβConsumer Pattern
Kafka follows a publishβsubscribe model:
- Producers publish messages to a topic.
- Consumers subscribe to one or more topics and process messages.
Example (Java):
ProducerRecord<String, String> record = new ProducerRecord<>("orders", "order-123");
producer.send(record);
This loose coupling allows producers and consumers to evolve independently.
π§± Topic Partitioning
Each topic is divided into multiple partitions.
βοΈ Parallelism & Throughput
More partitions = more parallelism. Kafka assigns partitions across brokers to spread load.
Example: With 4 partitions and 4 consumers, each consumer can read from a different partition in parallel.
π Ordering Guarantees
Kafka guarantees order only within a single partition, not across partitions.
Tip: To maintain order by key (e.g. user ID), use a partitioning key.
producer.send(new ProducerRecord<>("payments", "user-42", "payment-100"));
π₯ Consumer Groups
A consumer group allows multiple consumers to read from the same topic in parallel, with each partition assigned to only one consumer in the group.
βοΈ Load Balancing
Partitions are automatically distributed across consumers in the group.
π Rebalancing
When a consumer joins or leaves the group, Kafka rebalances partition assignments, possibly causing short downtime.
π High Availability
If one consumer fails, Kafka reassigns its partitions to others, ensuring fault tolerance.
π Offset Management
Kafka tracks which messages a consumer has read using offsets.
π Auto vs Manual Commit
- Auto commit: Kafka commits the offset automatically at a configured interval.
- Manual commit: The consumer explicitly commits offsets after successful processing.
Example:
consumer.commitSync(); // Manual commit after processing
π Committed vs Uncommitted Offsets
- Committed offsets: Stored in Kafka (or external store); recovery point after crash.
- Uncommitted offsets: In memory, may be lost on failure.
π Delivery Guarantees
Kafka supports three main delivery semantics:
β At-Least-Once
Default mode. Messages are never lost, but may be duplicated if consumer crashes before committing.
β οΈ At-Most-Once
Offsets are committed before processing, so messages may be lost on failure.
Used when duplicates are worse than loss (e.g. sensor metrics).
π Exactly-Once Semantics (EOS)
No loss, no duplication β hard to achieve, but Kafka supports it with special features.
π Idempotent Producers
Prevents duplicate records during retries. Kafka assigns a producer ID (PID) and sequence numbers.
Example:
props.put("enable.idempotence", "true");
This allows safe retries without duplicates.
π Transactions
Kafka transactions allow multiple writes across topics/partitions to be committed atomically.
Example:
producer.initTransactions();
producer.beginTransaction();
producer.send(...);
producer.send(...);
producer.commitTransaction();
Used in exactly-once pipelines or stream joins.
π‘οΈ 3. Reliability, Retention, and Cleanup
This section focuses on how Kafka ensures durability, handles cleanup, and deals with error scenarios to maintain long-term performance and reliability.
π Retention Policies
Kafka stores messages on disk for a configurable amount of time or size, regardless of whether they are consumed.
- Time-based retention: Keep data for X hours/days.
- Size-based retention: Keep data until log reaches X GB.
Example:
retention.ms=604800000 # 7 days
retention.bytes=1073741824 # 1 GB
This supports replayability and long-term auditing.
π Log Compaction
Used for topics where only the latest value per key is important (like user profiles or balances).
Kafka retains only the most recent record for each key.
Example:
Before compaction:
user-42 β v1
user-42 β v2
user-43 β v1
After compaction:
user-42 β v2
user-43 β v1
Enable with:
cleanup.policy=compact
β οΈ Dead Letter Queue (DLQ)
If a consumer cannot process a message (e.g., malformed JSON, downstream error), it can forward it to a dead letter topic for later review.
Common pattern:
- Process message.
- On failure, send to
orders.DLQ
.
This improves reliability by not blocking the pipeline on bad data.
π Replication & In-Sync Replicas (ISR)
Kafka replicates each partition across multiple brokers for fault tolerance.
- Leader replica: Handles reads and writes.
- Follower replicas: Sync data from leader.
- In-Sync Replicas (ISR): Followers fully caught up with the leader.
Example:
Topic: orders
Replication Factor: 3
Partition 0 β Broker 1 (leader), Broker 2, Broker 3 (followers)
If the leader fails, another replica from the ISR takes over.
β οΈ Replica Placement & Tuning Tips
-
Use replication factor β₯ 3 in production.
-
Monitor under-replicated partitions as a health check.
-
Tune:
min.insync.replicas=2
This ensures writes only succeed if at least two replicas are in sync β critical for high durability.
π 4. Processing & Patterns
This section dives into common Kafka design patterns used in microservices, event-driven architecture, and stream processing systems.
π§Ύ Event Sourcing
In event sourcing, state changes are captured as an immutable sequence of events.
Kafka acts as a source of truth by storing a timeline of what happened.
Example:
UserRegistered β EmailUpdated β PasswordChanged
Instead of storing current state (e.g., User object), you store all events that led to that state.
βοΈ CQRS with Kafka
CQRS (Command Query Responsibility Segregation) separates writes (commands) from reads (queries).
- Write side publishes events to Kafka.
- Read side listens to topics and builds queryable views (e.g., Elasticsearch or a database).
Example:
POST /placeOrder β produce β order_placed topic
order_placed β consumed by read service β materialize view
π Change Data Capture (CDC)
CDC captures database changes (insert/update/delete) and streams them as Kafka events.
Tools like Debezium and Kafka Connect can track changes in MySQL, Postgres, etc.
Example:
UPDATE users SET email = 'a@b.com'
β emits event βuser_updated
topic
This is useful for syncing data across services or databases.
π¦ Outbox Pattern
Solves the dual-write problem β when writing to DB and Kafka must be atomic.
How it works:
- Application writes to DB and an outbox table in the same transaction.
- A connector reads the outbox table and publishes events to Kafka.
Why? Ensures consistency between DB state and emitted events.
β Idempotent Consumers
Consumers should handle retries without causing duplicates or inconsistencies.
Approaches:
- Use a unique message key or ID.
- Deduplicate by keeping track of processed event IDs.
Example:
if (!processedEventIds.contains(event.id)) {
process(event);
markAsProcessed(event.id);
}
π§Ό Message Deduplication
Kafka does not deduplicate by default β if a producer retries, duplicates may occur.
Solutions:
- Enable idempotent producers (
enable.idempotence=true
) - Use deduplication logic at the consumer side
- Store the latest processed offset or event ID
π§΅ Saga Pattern (Distributed Transactions)
In distributed systems, sagas coordinate multiple steps using compensating actions instead of ACID transactions.
Kafka can orchestrate these via choreography or orchestration:
- Choreography: Services listen for events and trigger next steps.
- Orchestration: A central saga orchestrator sends commands/events.
Example:
order_created β reserve_inventory β payment_processed β order_confirmed
If payment fails β emit event β revert_inventory
π 5. Streaming & Transformation
This section explains how to process Kafka data in real time using the Kafka Streams API and ksqlDB, enabling powerful stream processing directly within the Kafka ecosystem.
π Kafka Streams API
Kafka Streams is a Java library for building real-time applications that process Kafka data directly from topics.
Key features:
- Stateless and stateful operations (map, filter, join, window)
- Exactly-once support
- Scales with Kafka partitions
Example: Count orders per user
KStream<String, String> orders = builder.stream("orders");
KTable<String, Long> counts = orders
.groupByKey()
.count();
counts.toStream().to("order-counts");
π KTable, KStream, Joins, Windowing
Kafka Streams introduces two core abstractions:
- KStream: A stream of events (like log lines)
- KTable: A changelog stream representing evolving state (like DB table)
Common operations:
- Joins: Enrich a stream using another stream/table.
- Windowing: Group events by time windows (e.g. count per 5 minutes).
Example:
KStream<String, String> payments = ...
KStream<String, String> shipments = ...
payments.join(shipments, ...); // Join on order ID
βοΈ Kafka Streams vs Kafka Consumer
Kafka Consumer | Kafka Streams |
---|---|
Low-level record access | High-level DSL for stream processing |
Manual offset management | Automatic offset tracking |
No stateful operations | Built-in state store support |
More flexible | More powerful for real-time processing |
Use Kafka Streams if you need aggregation, joins, or windowed logic.
π§ ksqlDB: SQL for Real-Time Streams
ksqlDB lets you write SQL-like queries over Kafka topics β great for non-Java users or rapid prototyping.
Example:
CREATE STREAM orders (id STRING, user_id STRING, total DOUBLE)
WITH (KAFKA_TOPIC='orders', VALUE_FORMAT='JSON');
SELECT user_id, COUNT(*) FROM orders WINDOW TUMBLING (SIZE 5 MINUTES)
GROUP BY user_id EMIT CHANGES;
It runs as a Kafka-native service and can produce output topics, making stream processing declarative.
𧬠6. Schemas & Serialization
This section explains how Kafka handles data format and schema evolution across producers and consumers, which is critical for maintaining compatibility and avoiding downstream failures.
π¦ Avro, Protobuf, JSON
Kafka messages are just byte arrays β serialization defines how data is encoded.
Common formats:
- JSON: Easy to read, but no schema enforcement.
- Avro (by Apache): Compact binary + rich schema support.
- Protobuf (by Google): Fast and schema-based, similar to Avro.
Example (Avro schema):
{
"type": "record",
"name": "User",
"fields": [
{ "name": "id", "type": "string" },
{ "name": "email", "type": "string" }
]
}
In code:
ProducerRecord<String, User> record = new ProducerRecord<>("users", user);
π Schema Evolution
Schema evolution allows you to change your data model over time without breaking consumers.
Avro and Protobuf allow for:
- Adding/removing optional fields
- Setting default values
- Managing backward/forward compatibility
Use Case:
- Add a new field
phone_number
with default = "" β old consumers wonβt break.
π§ Confluent Schema Registry
The Schema Registry stores and validates schemas centrally. It ensures that all messages on a topic conform to a registered schema.
Benefits:
- Enforces schema compatibility rules
- Enables automatic deserialization
- Works with Avro, Protobuf, JSON Schema
How it works:
- Producer registers the schema with Schema Registry.
- Consumer fetches schema by ID (embedded in message).
- Deserialization happens automatically.
Example config:
schema.registry.url=http://localhost:8081
π 7. Kafka Ecosystem & Integration
Kafka is more than just producers and consumers β its ecosystem includes powerful tools for data integration, replication, and API access, making it a complete event streaming platform.
π Kafka Connect
Kafka Connect is a framework for connecting Kafka to external systems (like databases, cloud storage, etc.) using pre-built source and sink connectors.
Example:
- Source connector: Pulls data from MySQL β Kafka topic
- Sink connector: Pushes data from Kafka β Elasticsearch
Deployment options:
- Standalone (simple setups)
- Distributed (fault-tolerant, scalable)
Config example:
{
"name": "mysql-source",
"connector.class": "MySqlConnector",
"topics": "users"
}
π Single Message Transforms (SMTs)
SMTs are lightweight transformations applied in Kafka Connect pipelines β useful for modifying records without writing custom code.
Examples:
- Add or remove fields
- Rename keys
- Mask sensitive data
Example SMT:
"transforms": "MaskSSN",
"transforms.MaskSSN.type": "org.apache.kafka.connect.transforms.MaskField$Value",
"transforms.MaskSSN.fields": "ssn"
π Kafka REST Proxy
The REST Proxy allows applications that canβt use native Kafka clients (e.g., browsers, scripts) to interact with Kafka over HTTP.
Use Cases:
- Serverless apps
- Frontend apps
- Lightweight tools
Example:
POST /topics/orders
{
"key": "order-1",
"value": "shipped"
}
It supports producing, consuming, and checking metadata (like topic list).
π Kafka MirrorMaker 2.0
MirrorMaker 2.0 replicates topics between Kafka clusters, often used for:
- Multi-data center replication
- Cloud-to-on-prem migrations
- Disaster recovery setups
Features:
- Replicates configs, ACLs, and offsets
- Based on Kafka Connect
- Supports topic renaming and filtering
Example:
source-cluster.topic.rename.format: "${topic}.replica"
π 8. Performance Optimization
This section dives into tuning Kafka for high throughput, low latency, and reliable delivery β crucial for production systems operating at scale.
βοΈ Throughput vs Latency
Kafka performance tuning is often a trade-off between maximum throughput and minimal latency.
π§ Key Producer Settings:
batch.size
: Max size (bytes) of a batch before sending.linger.ms
: How long to wait before sending a batch (even if not full).compression.type
: Compress batches (snappy, gzip, zstd).acks
: Required number of acknowledgments (0, 1, or all).
Example (Java):
props.put("batch.size", 16384);
props.put("linger.ms", 5);
props.put("compression.type", "snappy");
props.put("acks", "all");
π Backpressure Handling
Backpressure occurs when consumers are slower than producers.
Strategies:
- Use bounded queues in consumers
- Apply rate limiting on producers
- Monitor and scale consumer group size
Tip: Monitor consumer lag metrics to detect pressure.
π€ Producer Retries
Retries are important, but risky without idempotence.
retries
: Number of retry attempts (e.g., 5)max.in.flight.requests.per.connection
: Controls how many sends can be in-flight
Enable idempotent producer for safe retries:
props.put("enable.idempotence", "true");
βοΈ Rebalancing Strategies
By default, Kafka reassigns partitions when consumer group membership changes.
Newer assignors are smarter and less disruptive:
range
(default): Simple but can be unbalancedroundrobin
: Better spreadcooperative-sticky
(recommended): Minimizes partition movement
Example config:
props.put("partition.assignment.strategy", "CooperativeStickyAssignor");
This helps reduce downtime during scale-up or rolling restarts.
π 9. Security & Access Control
Kafka supports enterprise-grade security through encryption, authentication, and authorization, helping you protect sensitive data in transit and control who can do what.
π Encryption: SSL/TLS
Kafka supports TLS encryption for:
- Client β Broker
- Broker β Broker
Benefits:
- Protects data in transit
- Prevents man-in-the-middle (MITM) attacks
Example config (server):
ssl.keystore.location=/etc/kafka/server.keystore.jks
ssl.keystore.password=secret
security.inter.broker.protocol=SSL
Clients must trust the broker's certificate.
π€ Authentication: SASL
Kafka uses SASL (Simple Authentication and Security Layer) to authenticate users and services.
Common mechanisms:
PLAIN
: Username + passwordSCRAM
: Secure password hashingGSSAPI
: KerberosOAUTHBEARER
: Token-based auth
Example (SASL/PLAIN config):
sasl.mechanism=PLAIN
security.protocol=SASL_SSL
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required
username="admin" password="admin-secret";
π Authorization: ACLs (Access Control Lists)
Kafka can restrict who can read/write/admin using ACLs.
Granularity:
- Per-topic
- Per-consumer group
- Per-cluster
Example (CLI):
kafka-acls --add --allow-principal User:app1 --operation Read --topic payments
This allows app1
to consume from payments
topic.
π’ Multi-Tenant Kafka Setups
Kafka supports multi-tenancy by using:
- Namespaced topics (e.g.,
team1.orders
,team2.orders
) - Role-based ACLs
- Quotas (to limit bandwidth or connections)
Useful for shared environments (e.g., platform teams or large orgs).
π 10. Monitoring & Observability
To run Kafka reliably at scale, you need visibility into its health, performance, and data flow. This section outlines how to monitor Kafka effectively.
π§ͺ Key Metrics to Monitor
Kafka exposes many metrics via JMX. Here are the critical ones to watch:
π Lag Metrics
-
Consumer Lag: Difference between last produced and last consumed message.
- Indicates if consumers are falling behind.
βοΈ Broker Metrics
- Under-Replicated Partitions: Partitions missing ISR replicas.
- Request Latency: Time to serve produce/fetch requests.
- Disk Usage: Broker log sizes β may impact retention and compaction.
π Producer Metrics
- Record send rate
- Error rate
- Retries, latency
π₯ Consumer Metrics
- Fetch rate
- Commit rate
- Lag per partition
π οΈ Monitoring Tools
Kafka doesnβt ship with a UI, but integrates well with these tools:
π‘ Prometheus + Grafana
- Use JMX Exporter to expose Kafka metrics.
- Grafana dashboards offer great visual insight.
π Confluent Control Center
- GUI for Kafka monitoring (licensed).
- Tracks lag, throughput, errors, schema usage, etc.
π Others
- Datadog, Splunk, New Relic, Elastic Stack
- Cruise Control (for partition rebalancing and optimization)
π§Ύ Logging & Audit Trails
Kafka uses log4j for logging. Customize log levels in server.properties
or log4j config.
- Log retention can be tuned for auditing.
- Security-related logs (e.g., ACL failures) are crucial for detecting misuse.
Example:
log4j.logger.kafka.authorizer.logger=DEBUG, authorizerAppender
π οΈ 11. Operations & Deployment
This section focuses on how to configure, deploy, and manage Kafka clusters in real-world environments, ensuring stability, resilience, and scalability.
βοΈ Deploying Kafka (On-Prem vs Cloud)
Kafka can run in:
-
On-premise data centers (bare metal or VMs)
-
Public clouds (AWS, GCP, Azure)
-
Managed services:
- Confluent Cloud
- Amazon MSK
- Azure Event Hubs for Kafka
Trade-offs:
- Managed services reduce ops overhead but may cost more.
- On-prem gives full control but requires more setup/monitoring.
βοΈ Configuration Tuning
Key configs to optimize performance and reliability:
Brokers:
num.network.threads=3
num.io.threads=8
log.retention.hours=168
log.segment.bytes=1073741824
default.replication.factor=3
Producers:
acks=all
retries=5
linger.ms=10
batch.size=16384
Consumers:
enable.auto.commit=false
max.poll.records=500
session.timeout.ms=10000
β»οΈ Backup & Disaster Recovery
Kafka doesnβt have native backup, but you can use:
- MirrorMaker 2.0: Cluster replication
- Tiered storage (with vendors like Confluent)
- Snapshotting consumers' output (e.g., store materialized views in DB)
Best practice:
- Replication factor β₯ 3
- Use rack-awareness for better failure isolation
βΈοΈ Kafka in Kubernetes
Kafka can run well on Kubernetes using operators:
- Strimzi (Apache-licensed, widely used)
- Confluent Operator (enterprise offering)
Features:
- Automated scaling, upgrades, TLS, secrets
- Helm chart support
- Kafka Connect, MirrorMaker, and Schema Registry support
Note: Running Kafka on Kubernetes requires good disk and network planning (Kafka prefers fast persistent storage).
π§Ύ Kafka as a Database?
Kafka can act as a system of record (event store), but itβs not a queryable DB.
Strengths:
- Immutable logs
- Replayability
- Time-travel
Limitations:
- No secondary indexes
- No complex queries
Workaround: Combine with Kafka Streams or ksqlDB for stateful processing and querying.
π§ 12. Advanced Concepts & Patterns
This section explores advanced Kafka use cases and powerful patterns that go beyond basic messaging, enabling reprocessing, analytics, and modern data architectures.
π Replaying Events for Reprocessing
Kafka's log retention allows consumers to re-read old messages β useful for:
- Rebuilding state after a bug fix
- Reprocessing data with new logic
- Rehydrating a downstream system
How?
- Manually reset consumer offsets:
kafka-consumer-groups.sh --reset-offsets --to-earliest --group my-app --topic orders --execute
π§ͺ Using Kafka with Flink or Spark Streaming
Kafka integrates well with stream processing frameworks:
- Apache Flink: Advanced event time processing, exactly-once, CEP.
- Apache Spark Structured Streaming: Batch + stream hybrid.
Example (Spark):
df = spark.readStream \
.format("kafka") \
.option("subscribe", "orders") \
.load()
Use when Kafka Streams isnβt flexible enough (e.g., ML pipelines, large windowed joins).
π§± Kafka in Serverless & Microservices
Kafka fits naturally into event-driven microservice architectures.
Serverless use cases:
- Triggering AWS Lambda with Kafka via MSK or EventBridge
- Decoupling systems without direct REST calls
Benefits:
- Resilience
- Asynchronous flow
- Observability via event logs
π§ Kafka for ML Pipelines / Feature Stores
Kafka can serve as the backbone for real-time ML systems:
- Ingest real-time data (e.g., user clicks, fraud signals)
- Stream processing for feature extraction
- Store to feature store or model input pipelines
Pattern:
Raw Events β Kafka β Feature Stream β Feature Store / Model Inference
Works especially well with Flink or Kafka Streams + TensorFlow, PyTorch, etc.
ποΈ Event-Driven Architecture Patterns
Kafka supports:
- Event Notification: Lightweight events trigger actions.
- Event-Carried State Transfer: Full state embedded in event (for denormalized systems).
- Event Sourcing + CQRS: Reliable history of state changes and projections.
Tip: Choose based on your system's need for consistency, scalability, and auditability.
π§― 13. Most Common Kafka Issues and Solutions
This section covers frequent problems developers and operators encounter with Kafka, and offers actionable solutions and best practices to resolve them quickly.
π’ 1. High Consumer Lag
Symptoms:
- Consumer is not keeping up with the producer
- Growing lag visible in monitoring tools
Causes:
- Slow processing logic
- Not enough consumers in the group
- Unbalanced partition assignment
Solutions:
- Optimize message processing (e.g., use async/parallelism)
- Scale consumer instances
- Use
max.poll.records
andfetch.max.bytes
wisely - Monitor and alert on lag metrics
π 2. Consumer Rebalance Storms
Symptoms:
- Frequent rebalancing
- Consumers constantly being kicked out and rejoining
Causes:
- Long processing without
poll()
- Flaky consumer instances
- Using old assignment strategies
Solutions:
- Increase
session.timeout.ms
or useheartbeat.interval.ms
properly - Use
CooperativeStickyAssignor
instead ofrange
orroundrobin
- Ensure
poll()
is called within allowed interval - Avoid killing consumers too aggressively (e.g., via autoscaling)
β 3. Messages Not Being Delivered
Symptoms:
- Messages produced but not consumed
- Empty topic on consumer side
Causes:
- Consumer group offset not committed or stuck
- Wrong topic or partition config
- Topic retention expired the data
Solutions:
- Check if consumer is part of the correct group
- Reset offsets using
kafka-consumer-groups.sh
- Verify retention settings (
retention.ms
) - Confirm the producer is writing to the correct topic
π 4. Under-Replicated Partitions
Symptoms:
- Partitions missing followers in ISR (In-Sync Replica)
- Broker disk I/O spikes
Causes:
- Slow disk or network
- Broker down or overloaded
Solutions:
- Add more brokers or redistribute partitions
- Check disk performance and broker logs
- Set alerting on
under_replicated_partitions
π 5. Authorization Failures / ACL Errors
Symptoms:
- Clients get βnot authorizedβ errors
- Producers or consumers cannot connect
Causes:
- ACLs misconfigured or missing
- SASL/SSL not correctly set up
Solutions:
- Use
kafka-acls.sh
to review and add necessary permissions - Double-check client credentials and broker security config
- Log and monitor
kafka.authorizer.logger
π§Ό 6. Duplicate Messages
Symptoms:
- Same message processed multiple times
- Downstream system shows double entries
Causes:
- Retries without idempotent producer
- At-least-once delivery without deduplication
Solutions:
- Enable idempotence:
enable.idempotence=true
- Use message keys for idempotency (e.g., order ID)
- Implement deduplication logic in consumer
π§― 7. Disk Space Full
Symptoms:
- Kafka broker crashes or stops accepting writes
- OS alerts for low disk space
Causes:
- Retention settings too generous
- Compaction not cleaning up as expected
Solutions:
- Tune
retention.ms
,retention.bytes
, orlog.segment.bytes
- Check compaction and cleanup policies
- Use
log.dirs
across multiple mount points
π 8. Offset Commit Failures
Symptoms:
- Exceptions in consumer logs related to commit
- Messages processed multiple times after restart
Causes:
- Using auto-commit with long processing
- Commit offset before processing (at-most-once)
Solutions:
- Use manual commits after processing
- Use
commitSync()
orcommitAsync()
with error handling - Log and monitor commit failures
π§ͺ 9. Serialization / Deserialization Errors
Symptoms:
- Exceptions during produce or consume
- Invalid format or corrupted data
Causes:
- Mismatched schema between producer and consumer
- Schema evolution breaking compatibility
- Corrupt messages
Solutions:
- Use Schema Registry to enforce compatibility
- Validate all changes to schema before deployment
- Add error handling to your serializers/deserializers