OctoPlasm


Kafka Reference: Topics & Concepts Cheat Sheet


🧠 1. Kafka Fundamentals


πŸ“Œ What is Kafka? Use Cases & Design Goals

Apache Kafka is a distributed event streaming platform used for building real-time data pipelines and streaming apps. It’s designed for high throughput, low latency, and scalability.

Common use cases:

  • Log aggregation
  • Real-time analytics
  • Event sourcing
  • Messaging between microservices
  • Stream processing (e.g., fraud detection)

Design goals:

  • Durable (disk-based storage)
  • Scalable (horizontally)
  • Fault-tolerant
  • High performance

🧱 Kafka High-Level Architecture

🧩 Brokers

Kafka brokers are servers that store data and serve client requests. Each broker handles a subset of partitions.

Example: If you have 3 brokers and 12 partitions, each broker may manage 4 partitions.

πŸ“¦ Topics & Partitions

A topic is a named stream of data. It is split into partitions, which allow for parallelism.

Example:

Topic: orders
Partitions: [0, 1, 2]

Each partition stores an ordered log of messages.

✍️ Producers & Consumers

  • Producers publish messages to topics.
  • Consumers subscribe to topics and read messages.

Example:

producer.send(new ProducerRecord<>("orders", "order-123"));

πŸ—‚οΈ ZooKeeper vs KRaft Mode

  • ZooKeeper: Older architecture where ZooKeeper handles cluster metadata and leadership.
  • KRaft (Kafka Raft Metadata mode): Kafka-native metadata management (no ZooKeeper), simplifying operations.

🧠 Kafka Controller

A special broker elected as the controller manages partition leadership, broker registrations, and reassignments.


πŸͺ΅ Kafka’s Storage Model: Log-Based Messaging

Each partition is an append-only log, and every message gets a unique offset. Consumers track offsets to read messages.

Example:

Partition 0 log:
Offset 0: order-123
Offset 1: order-124

Consumers can replay or rewind messages by resetting offsets.


⚑ What Makes Kafka Fast?

πŸ’Ύ Zero-Copy

Kafka uses zero-copy to transfer data from disk to network buffer directly, reducing CPU overhead.

πŸ” Sequential Disk Writes

Kafka writes to disk sequentially, which is much faster than random writes.

Why? It minimizes seek time and improves disk I/O efficiency.

πŸ“‚ Page Cache

Kafka benefits from the OS's page cache, allowing frequently accessed data to be served from memory.

πŸ—œοΈ Compression

Kafka supports compression (e.g., Snappy, GZIP), reducing network and disk usage.

Example:

props.put("compression.type", "snappy");

πŸ“¦ Batching

Producers and consumers batch multiple records together to reduce overhead and improve throughput.

Example:

props.put("batch.size", 16384);
props.put("linger.ms", 10);

πŸ”„ 2. Core Concepts & Data Flow


πŸ” Producer–Consumer Pattern

Kafka follows a publish–subscribe model:

  • Producers publish messages to a topic.
  • Consumers subscribe to one or more topics and process messages.

Example (Java):

ProducerRecord<String, String> record = new ProducerRecord<>("orders", "order-123");
producer.send(record);

This loose coupling allows producers and consumers to evolve independently.


🧱 Topic Partitioning

Each topic is divided into multiple partitions.

βš™οΈ Parallelism & Throughput

More partitions = more parallelism. Kafka assigns partitions across brokers to spread load.

Example: With 4 partitions and 4 consumers, each consumer can read from a different partition in parallel.

πŸ”’ Ordering Guarantees

Kafka guarantees order only within a single partition, not across partitions.

Tip: To maintain order by key (e.g. user ID), use a partitioning key.

producer.send(new ProducerRecord<>("payments", "user-42", "payment-100"));

πŸ‘₯ Consumer Groups

A consumer group allows multiple consumers to read from the same topic in parallel, with each partition assigned to only one consumer in the group.

βš–οΈ Load Balancing

Partitions are automatically distributed across consumers in the group.

πŸ”„ Rebalancing

When a consumer joins or leaves the group, Kafka rebalances partition assignments, possibly causing short downtime.

πŸ†™ High Availability

If one consumer fails, Kafka reassigns its partitions to others, ensuring fault tolerance.


πŸ“Œ Offset Management

Kafka tracks which messages a consumer has read using offsets.

πŸ” Auto vs Manual Commit

  • Auto commit: Kafka commits the offset automatically at a configured interval.
  • Manual commit: The consumer explicitly commits offsets after successful processing.

Example:

consumer.commitSync(); // Manual commit after processing

πŸ“ Committed vs Uncommitted Offsets

  • Committed offsets: Stored in Kafka (or external store); recovery point after crash.
  • Uncommitted offsets: In memory, may be lost on failure.

🚚 Delivery Guarantees

Kafka supports three main delivery semantics:

βœ… At-Least-Once

Default mode. Messages are never lost, but may be duplicated if consumer crashes before committing.

⚠️ At-Most-Once

Offsets are committed before processing, so messages may be lost on failure.

Used when duplicates are worse than loss (e.g. sensor metrics).

πŸ”’ Exactly-Once Semantics (EOS)

No loss, no duplication β€” hard to achieve, but Kafka supports it with special features.


πŸ†” Idempotent Producers

Prevents duplicate records during retries. Kafka assigns a producer ID (PID) and sequence numbers.

Example:

props.put("enable.idempotence", "true");

This allows safe retries without duplicates.

πŸ” Transactions

Kafka transactions allow multiple writes across topics/partitions to be committed atomically.

Example:

producer.initTransactions();
producer.beginTransaction();
producer.send(...);
producer.send(...);
producer.commitTransaction();

Used in exactly-once pipelines or stream joins.


πŸ›‘οΈ 3. Reliability, Retention, and Cleanup

This section focuses on how Kafka ensures durability, handles cleanup, and deals with error scenarios to maintain long-term performance and reliability.


πŸ“† Retention Policies

Kafka stores messages on disk for a configurable amount of time or size, regardless of whether they are consumed.

  • Time-based retention: Keep data for X hours/days.
  • Size-based retention: Keep data until log reaches X GB.

Example:

retention.ms=604800000  # 7 days
retention.bytes=1073741824  # 1 GB

This supports replayability and long-term auditing.


πŸ”„ Log Compaction

Used for topics where only the latest value per key is important (like user profiles or balances).

Kafka retains only the most recent record for each key.

Example:

Before compaction:

user-42 β†’ v1  
user-42 β†’ v2  
user-43 β†’ v1  

After compaction:

user-42 β†’ v2  
user-43 β†’ v1

Enable with:

cleanup.policy=compact

☠️ Dead Letter Queue (DLQ)

If a consumer cannot process a message (e.g., malformed JSON, downstream error), it can forward it to a dead letter topic for later review.

Common pattern:

  • Process message.
  • On failure, send to orders.DLQ.

This improves reliability by not blocking the pipeline on bad data.


πŸ›‘ Replication & In-Sync Replicas (ISR)

Kafka replicates each partition across multiple brokers for fault tolerance.

  • Leader replica: Handles reads and writes.
  • Follower replicas: Sync data from leader.
  • In-Sync Replicas (ISR): Followers fully caught up with the leader.

Example:

Topic: orders  
Replication Factor: 3  
Partition 0 β†’ Broker 1 (leader), Broker 2, Broker 3 (followers)

If the leader fails, another replica from the ISR takes over.


⚠️ Replica Placement & Tuning Tips

  • Use replication factor β‰₯ 3 in production.

  • Monitor under-replicated partitions as a health check.

  • Tune:

    min.insync.replicas=2
    

This ensures writes only succeed if at least two replicas are in sync β€” critical for high durability.


πŸ” 4. Processing & Patterns

This section dives into common Kafka design patterns used in microservices, event-driven architecture, and stream processing systems.


🧾 Event Sourcing

In event sourcing, state changes are captured as an immutable sequence of events.

Kafka acts as a source of truth by storing a timeline of what happened.

Example:

UserRegistered β†’ EmailUpdated β†’ PasswordChanged

Instead of storing current state (e.g., User object), you store all events that led to that state.


βš”οΈ CQRS with Kafka

CQRS (Command Query Responsibility Segregation) separates writes (commands) from reads (queries).

  • Write side publishes events to Kafka.
  • Read side listens to topics and builds queryable views (e.g., Elasticsearch or a database).

Example:

POST /placeOrder β†’ produce β†’ order_placed topic  
order_placed β†’ consumed by read service β†’ materialize view

πŸ”„ Change Data Capture (CDC)

CDC captures database changes (insert/update/delete) and streams them as Kafka events.

Tools like Debezium and Kafka Connect can track changes in MySQL, Postgres, etc.

Example:

  • UPDATE users SET email = 'a@b.com' β†’ emits event β†’ user_updated topic

This is useful for syncing data across services or databases.


πŸ“¦ Outbox Pattern

Solves the dual-write problem β€” when writing to DB and Kafka must be atomic.

How it works:

  1. Application writes to DB and an outbox table in the same transaction.
  2. A connector reads the outbox table and publishes events to Kafka.

Why? Ensures consistency between DB state and emitted events.


βœ… Idempotent Consumers

Consumers should handle retries without causing duplicates or inconsistencies.

Approaches:

  • Use a unique message key or ID.
  • Deduplicate by keeping track of processed event IDs.

Example:

if (!processedEventIds.contains(event.id)) {
   process(event);
   markAsProcessed(event.id);
}

🧼 Message Deduplication

Kafka does not deduplicate by default β€” if a producer retries, duplicates may occur.

Solutions:

  • Enable idempotent producers (enable.idempotence=true)
  • Use deduplication logic at the consumer side
  • Store the latest processed offset or event ID

🧡 Saga Pattern (Distributed Transactions)

In distributed systems, sagas coordinate multiple steps using compensating actions instead of ACID transactions.

Kafka can orchestrate these via choreography or orchestration:

  • Choreography: Services listen for events and trigger next steps.
  • Orchestration: A central saga orchestrator sends commands/events.

Example:

order_created β†’ reserve_inventory β†’ payment_processed β†’ order_confirmed  
If payment fails β†’ emit event β†’ revert_inventory

πŸ”€ 5. Streaming & Transformation

This section explains how to process Kafka data in real time using the Kafka Streams API and ksqlDB, enabling powerful stream processing directly within the Kafka ecosystem.


🌊 Kafka Streams API

Kafka Streams is a Java library for building real-time applications that process Kafka data directly from topics.

Key features:

  • Stateless and stateful operations (map, filter, join, window)
  • Exactly-once support
  • Scales with Kafka partitions

Example: Count orders per user

KStream<String, String> orders = builder.stream("orders");
KTable<String, Long> counts = orders
    .groupByKey()
    .count();
counts.toStream().to("order-counts");

πŸ”„ KTable, KStream, Joins, Windowing

Kafka Streams introduces two core abstractions:

  • KStream: A stream of events (like log lines)
  • KTable: A changelog stream representing evolving state (like DB table)

Common operations:

  • Joins: Enrich a stream using another stream/table.
  • Windowing: Group events by time windows (e.g. count per 5 minutes).

Example:

KStream<String, String> payments = ...
KStream<String, String> shipments = ...
payments.join(shipments, ...);  // Join on order ID

βš–οΈ Kafka Streams vs Kafka Consumer

Kafka ConsumerKafka Streams
Low-level record accessHigh-level DSL for stream processing
Manual offset managementAutomatic offset tracking
No stateful operationsBuilt-in state store support
More flexibleMore powerful for real-time processing

Use Kafka Streams if you need aggregation, joins, or windowed logic.


🧠 ksqlDB: SQL for Real-Time Streams

ksqlDB lets you write SQL-like queries over Kafka topics β€” great for non-Java users or rapid prototyping.

Example:

CREATE STREAM orders (id STRING, user_id STRING, total DOUBLE)
  WITH (KAFKA_TOPIC='orders', VALUE_FORMAT='JSON');

SELECT user_id, COUNT(*) FROM orders WINDOW TUMBLING (SIZE 5 MINUTES)
GROUP BY user_id EMIT CHANGES;

It runs as a Kafka-native service and can produce output topics, making stream processing declarative.


🧬 6. Schemas & Serialization

This section explains how Kafka handles data format and schema evolution across producers and consumers, which is critical for maintaining compatibility and avoiding downstream failures.


πŸ“¦ Avro, Protobuf, JSON

Kafka messages are just byte arrays β€” serialization defines how data is encoded.

Common formats:

  • JSON: Easy to read, but no schema enforcement.
  • Avro (by Apache): Compact binary + rich schema support.
  • Protobuf (by Google): Fast and schema-based, similar to Avro.

Example (Avro schema):

{
  "type": "record",
  "name": "User",
  "fields": [
    { "name": "id", "type": "string" },
    { "name": "email", "type": "string" }
  ]
}

In code:

ProducerRecord<String, User> record = new ProducerRecord<>("users", user);

πŸ”„ Schema Evolution

Schema evolution allows you to change your data model over time without breaking consumers.

Avro and Protobuf allow for:

  • Adding/removing optional fields
  • Setting default values
  • Managing backward/forward compatibility

Use Case:

  • Add a new field phone_number with default = "" β†’ old consumers won’t break.

🧠 Confluent Schema Registry

The Schema Registry stores and validates schemas centrally. It ensures that all messages on a topic conform to a registered schema.

Benefits:

  • Enforces schema compatibility rules
  • Enables automatic deserialization
  • Works with Avro, Protobuf, JSON Schema

How it works:

  1. Producer registers the schema with Schema Registry.
  2. Consumer fetches schema by ID (embedded in message).
  3. Deserialization happens automatically.

Example config:

schema.registry.url=http://localhost:8081

πŸ”— 7. Kafka Ecosystem & Integration

Kafka is more than just producers and consumers β€” its ecosystem includes powerful tools for data integration, replication, and API access, making it a complete event streaming platform.


πŸ”Œ Kafka Connect

Kafka Connect is a framework for connecting Kafka to external systems (like databases, cloud storage, etc.) using pre-built source and sink connectors.

Example:

  • Source connector: Pulls data from MySQL β†’ Kafka topic
  • Sink connector: Pushes data from Kafka β†’ Elasticsearch

Deployment options:

  • Standalone (simple setups)
  • Distributed (fault-tolerant, scalable)

Config example:

{
  "name": "mysql-source",
  "connector.class": "MySqlConnector",
  "topics": "users"
}

πŸ”„ Single Message Transforms (SMTs)

SMTs are lightweight transformations applied in Kafka Connect pipelines β€” useful for modifying records without writing custom code.

Examples:

  • Add or remove fields
  • Rename keys
  • Mask sensitive data

Example SMT:

"transforms": "MaskSSN",
"transforms.MaskSSN.type": "org.apache.kafka.connect.transforms.MaskField$Value",
"transforms.MaskSSN.fields": "ssn"

🌐 Kafka REST Proxy

The REST Proxy allows applications that can’t use native Kafka clients (e.g., browsers, scripts) to interact with Kafka over HTTP.

Use Cases:

  • Serverless apps
  • Frontend apps
  • Lightweight tools

Example:

POST /topics/orders
{
  "key": "order-1",
  "value": "shipped"
}

It supports producing, consuming, and checking metadata (like topic list).


🌍 Kafka MirrorMaker 2.0

MirrorMaker 2.0 replicates topics between Kafka clusters, often used for:

  • Multi-data center replication
  • Cloud-to-on-prem migrations
  • Disaster recovery setups

Features:

  • Replicates configs, ACLs, and offsets
  • Based on Kafka Connect
  • Supports topic renaming and filtering

Example:

source-cluster.topic.rename.format: "${topic}.replica"

πŸš€ 8. Performance Optimization

This section dives into tuning Kafka for high throughput, low latency, and reliable delivery β€” crucial for production systems operating at scale.


βš–οΈ Throughput vs Latency

Kafka performance tuning is often a trade-off between maximum throughput and minimal latency.

πŸ”§ Key Producer Settings:

  • batch.size: Max size (bytes) of a batch before sending.
  • linger.ms: How long to wait before sending a batch (even if not full).
  • compression.type: Compress batches (snappy, gzip, zstd).
  • acks: Required number of acknowledgments (0, 1, or all).

Example (Java):

props.put("batch.size", 16384);
props.put("linger.ms", 5);
props.put("compression.type", "snappy");
props.put("acks", "all");

πŸ›‘ Backpressure Handling

Backpressure occurs when consumers are slower than producers.

Strategies:

  • Use bounded queues in consumers
  • Apply rate limiting on producers
  • Monitor and scale consumer group size

Tip: Monitor consumer lag metrics to detect pressure.


πŸ“€ Producer Retries

Retries are important, but risky without idempotence.

  • retries: Number of retry attempts (e.g., 5)
  • max.in.flight.requests.per.connection: Controls how many sends can be in-flight

Enable idempotent producer for safe retries:

props.put("enable.idempotence", "true");

βš™οΈ Rebalancing Strategies

By default, Kafka reassigns partitions when consumer group membership changes.

Newer assignors are smarter and less disruptive:

  • range (default): Simple but can be unbalanced
  • roundrobin: Better spread
  • cooperative-sticky (recommended): Minimizes partition movement

Example config:

props.put("partition.assignment.strategy", "CooperativeStickyAssignor");

This helps reduce downtime during scale-up or rolling restarts.


πŸ” 9. Security & Access Control

Kafka supports enterprise-grade security through encryption, authentication, and authorization, helping you protect sensitive data in transit and control who can do what.


πŸ”’ Encryption: SSL/TLS

Kafka supports TLS encryption for:

  • Client ↔ Broker
  • Broker ↔ Broker

Benefits:

  • Protects data in transit
  • Prevents man-in-the-middle (MITM) attacks

Example config (server):

ssl.keystore.location=/etc/kafka/server.keystore.jks
ssl.keystore.password=secret
security.inter.broker.protocol=SSL

Clients must trust the broker's certificate.


πŸ‘€ Authentication: SASL

Kafka uses SASL (Simple Authentication and Security Layer) to authenticate users and services.

Common mechanisms:

  • PLAIN: Username + password
  • SCRAM: Secure password hashing
  • GSSAPI: Kerberos
  • OAUTHBEARER: Token-based auth

Example (SASL/PLAIN config):

sasl.mechanism=PLAIN
security.protocol=SASL_SSL
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required
  username="admin" password="admin-secret";

πŸ›‚ Authorization: ACLs (Access Control Lists)

Kafka can restrict who can read/write/admin using ACLs.

Granularity:

  • Per-topic
  • Per-consumer group
  • Per-cluster

Example (CLI):

kafka-acls --add --allow-principal User:app1 --operation Read --topic payments

This allows app1 to consume from payments topic.


🏒 Multi-Tenant Kafka Setups

Kafka supports multi-tenancy by using:

  • Namespaced topics (e.g., team1.orders, team2.orders)
  • Role-based ACLs
  • Quotas (to limit bandwidth or connections)

Useful for shared environments (e.g., platform teams or large orgs).


πŸ“ˆ 10. Monitoring & Observability

To run Kafka reliably at scale, you need visibility into its health, performance, and data flow. This section outlines how to monitor Kafka effectively.


πŸ§ͺ Key Metrics to Monitor

Kafka exposes many metrics via JMX. Here are the critical ones to watch:

πŸ”„ Lag Metrics

  • Consumer Lag: Difference between last produced and last consumed message.

    • Indicates if consumers are falling behind.

βš™οΈ Broker Metrics

  • Under-Replicated Partitions: Partitions missing ISR replicas.
  • Request Latency: Time to serve produce/fetch requests.
  • Disk Usage: Broker log sizes β€” may impact retention and compaction.

πŸ“Š Producer Metrics

  • Record send rate
  • Error rate
  • Retries, latency

πŸ‘₯ Consumer Metrics

  • Fetch rate
  • Commit rate
  • Lag per partition

πŸ› οΈ Monitoring Tools

Kafka doesn’t ship with a UI, but integrates well with these tools:

πŸ“‘ Prometheus + Grafana

  • Use JMX Exporter to expose Kafka metrics.
  • Grafana dashboards offer great visual insight.

πŸ“ˆ Confluent Control Center

  • GUI for Kafka monitoring (licensed).
  • Tracks lag, throughput, errors, schema usage, etc.

πŸ“‹ Others

  • Datadog, Splunk, New Relic, Elastic Stack
  • Cruise Control (for partition rebalancing and optimization)

🧾 Logging & Audit Trails

Kafka uses log4j for logging. Customize log levels in server.properties or log4j config.

  • Log retention can be tuned for auditing.
  • Security-related logs (e.g., ACL failures) are crucial for detecting misuse.

Example:

log4j.logger.kafka.authorizer.logger=DEBUG, authorizerAppender

πŸ› οΈ 11. Operations & Deployment

This section focuses on how to configure, deploy, and manage Kafka clusters in real-world environments, ensuring stability, resilience, and scalability.


☁️ Deploying Kafka (On-Prem vs Cloud)

Kafka can run in:

  • On-premise data centers (bare metal or VMs)

  • Public clouds (AWS, GCP, Azure)

  • Managed services:

    • Confluent Cloud
    • Amazon MSK
    • Azure Event Hubs for Kafka

Trade-offs:

  • Managed services reduce ops overhead but may cost more.
  • On-prem gives full control but requires more setup/monitoring.

βš™οΈ Configuration Tuning

Key configs to optimize performance and reliability:

Brokers:

num.network.threads=3
num.io.threads=8
log.retention.hours=168
log.segment.bytes=1073741824
default.replication.factor=3

Producers:

acks=all
retries=5
linger.ms=10
batch.size=16384

Consumers:

enable.auto.commit=false
max.poll.records=500
session.timeout.ms=10000

♻️ Backup & Disaster Recovery

Kafka doesn’t have native backup, but you can use:

  • MirrorMaker 2.0: Cluster replication
  • Tiered storage (with vendors like Confluent)
  • Snapshotting consumers' output (e.g., store materialized views in DB)

Best practice:

  • Replication factor β‰₯ 3
  • Use rack-awareness for better failure isolation

☸️ Kafka in Kubernetes

Kafka can run well on Kubernetes using operators:

  • Strimzi (Apache-licensed, widely used)
  • Confluent Operator (enterprise offering)

Features:

  • Automated scaling, upgrades, TLS, secrets
  • Helm chart support
  • Kafka Connect, MirrorMaker, and Schema Registry support

Note: Running Kafka on Kubernetes requires good disk and network planning (Kafka prefers fast persistent storage).


🧾 Kafka as a Database?

Kafka can act as a system of record (event store), but it’s not a queryable DB.

Strengths:

  • Immutable logs
  • Replayability
  • Time-travel

Limitations:

  • No secondary indexes
  • No complex queries

Workaround: Combine with Kafka Streams or ksqlDB for stateful processing and querying.


🧠 12. Advanced Concepts & Patterns

This section explores advanced Kafka use cases and powerful patterns that go beyond basic messaging, enabling reprocessing, analytics, and modern data architectures.


πŸ” Replaying Events for Reprocessing

Kafka's log retention allows consumers to re-read old messages β€” useful for:

  • Rebuilding state after a bug fix
  • Reprocessing data with new logic
  • Rehydrating a downstream system

How?

  • Manually reset consumer offsets:
kafka-consumer-groups.sh --reset-offsets --to-earliest --group my-app --topic orders --execute

πŸ§ͺ Using Kafka with Flink or Spark Streaming

Kafka integrates well with stream processing frameworks:

  • Apache Flink: Advanced event time processing, exactly-once, CEP.
  • Apache Spark Structured Streaming: Batch + stream hybrid.

Example (Spark):

df = spark.readStream \
  .format("kafka") \
  .option("subscribe", "orders") \
  .load()

Use when Kafka Streams isn’t flexible enough (e.g., ML pipelines, large windowed joins).


🧱 Kafka in Serverless & Microservices

Kafka fits naturally into event-driven microservice architectures.

Serverless use cases:

  • Triggering AWS Lambda with Kafka via MSK or EventBridge
  • Decoupling systems without direct REST calls

Benefits:

  • Resilience
  • Asynchronous flow
  • Observability via event logs

🧠 Kafka for ML Pipelines / Feature Stores

Kafka can serve as the backbone for real-time ML systems:

  • Ingest real-time data (e.g., user clicks, fraud signals)
  • Stream processing for feature extraction
  • Store to feature store or model input pipelines

Pattern:

Raw Events β†’ Kafka β†’ Feature Stream β†’ Feature Store / Model Inference

Works especially well with Flink or Kafka Streams + TensorFlow, PyTorch, etc.


πŸ—ƒοΈ Event-Driven Architecture Patterns

Kafka supports:

  • Event Notification: Lightweight events trigger actions.
  • Event-Carried State Transfer: Full state embedded in event (for denormalized systems).
  • Event Sourcing + CQRS: Reliable history of state changes and projections.

Tip: Choose based on your system's need for consistency, scalability, and auditability.


🧯 13. Most Common Kafka Issues and Solutions

This section covers frequent problems developers and operators encounter with Kafka, and offers actionable solutions and best practices to resolve them quickly.


🐒 1. High Consumer Lag

Symptoms:

  • Consumer is not keeping up with the producer
  • Growing lag visible in monitoring tools

Causes:

  • Slow processing logic
  • Not enough consumers in the group
  • Unbalanced partition assignment

Solutions:

  • Optimize message processing (e.g., use async/parallelism)
  • Scale consumer instances
  • Use max.poll.records and fetch.max.bytes wisely
  • Monitor and alert on lag metrics

πŸ” 2. Consumer Rebalance Storms

Symptoms:

  • Frequent rebalancing
  • Consumers constantly being kicked out and rejoining

Causes:

  • Long processing without poll()
  • Flaky consumer instances
  • Using old assignment strategies

Solutions:

  • Increase session.timeout.ms or use heartbeat.interval.ms properly
  • Use CooperativeStickyAssignor instead of range or roundrobin
  • Ensure poll() is called within allowed interval
  • Avoid killing consumers too aggressively (e.g., via autoscaling)

❗ 3. Messages Not Being Delivered

Symptoms:

  • Messages produced but not consumed
  • Empty topic on consumer side

Causes:

  • Consumer group offset not committed or stuck
  • Wrong topic or partition config
  • Topic retention expired the data

Solutions:

  • Check if consumer is part of the correct group
  • Reset offsets using kafka-consumer-groups.sh
  • Verify retention settings (retention.ms)
  • Confirm the producer is writing to the correct topic

πŸ›‘ 4. Under-Replicated Partitions

Symptoms:

  • Partitions missing followers in ISR (In-Sync Replica)
  • Broker disk I/O spikes

Causes:

  • Slow disk or network
  • Broker down or overloaded

Solutions:

  • Add more brokers or redistribute partitions
  • Check disk performance and broker logs
  • Set alerting on under_replicated_partitions

πŸ”’ 5. Authorization Failures / ACL Errors

Symptoms:

  • Clients get β€œnot authorized” errors
  • Producers or consumers cannot connect

Causes:

  • ACLs misconfigured or missing
  • SASL/SSL not correctly set up

Solutions:

  • Use kafka-acls.sh to review and add necessary permissions
  • Double-check client credentials and broker security config
  • Log and monitor kafka.authorizer.logger

🧼 6. Duplicate Messages

Symptoms:

  • Same message processed multiple times
  • Downstream system shows double entries

Causes:

  • Retries without idempotent producer
  • At-least-once delivery without deduplication

Solutions:

  • Enable idempotence: enable.idempotence=true
  • Use message keys for idempotency (e.g., order ID)
  • Implement deduplication logic in consumer

🧯 7. Disk Space Full

Symptoms:

  • Kafka broker crashes or stops accepting writes
  • OS alerts for low disk space

Causes:

  • Retention settings too generous
  • Compaction not cleaning up as expected

Solutions:

  • Tune retention.ms, retention.bytes, or log.segment.bytes
  • Check compaction and cleanup policies
  • Use log.dirs across multiple mount points

🐞 8. Offset Commit Failures

Symptoms:

  • Exceptions in consumer logs related to commit
  • Messages processed multiple times after restart

Causes:

  • Using auto-commit with long processing
  • Commit offset before processing (at-most-once)

Solutions:

  • Use manual commits after processing
  • Use commitSync() or commitAsync() with error handling
  • Log and monitor commit failures

πŸ§ͺ 9. Serialization / Deserialization Errors

Symptoms:

  • Exceptions during produce or consume
  • Invalid format or corrupted data

Causes:

  • Mismatched schema between producer and consumer
  • Schema evolution breaking compatibility
  • Corrupt messages

Solutions:

  • Use Schema Registry to enforce compatibility
  • Validate all changes to schema before deployment
  • Add error handling to your serializers/deserializers