Kafka — Basics

Apache Kafka — Basics

What it is

Distributed, append-only log. High throughput, durable, replayable. Used for event streaming, message bus, log aggregation, CDC, stream processing.

Core concepts

Topic — named stream of records (~ table).
Partition — ordered, immutable log inside a topic. Unit of parallelism.
Offset — sequential id within a partition.
Broker — Kafka server. Cluster has many.
Producer — writes records.
Consumer — reads records.
Consumer group — set of consumers cooperating; each partition is read by exactly one consumer in the group.
Replica — partition copies on other brokers. Leader handles I/O; followers replicate.
ISR (in-sync replicas) — replicas caught up with leader.
Controller — coordinator broker (KRaft replaces ZooKeeper).

Producer essentials

Records have key, value, headers, timestamp.
Key determines partition (default: hash(key) % numPartitions). Same key → same partition → ordering guarantee for that key.
Acks:
- acks=0 — fire and forget; lost on failure.
- acks=1 — leader ack only; lost if leader fails before replication.
- acks=all — wait for ISR; safest. Pair with min.insync.replicas=2.
Idempotent producer (enable.idempotence=true) — dedupes within session. Default in 3.0+.
Batching (linger.ms, batch.size) trades latency for throughput.

Consumer essentials

Pulls (long-poll) from broker.
Tracks offsets per partition. Committed offsets stored in __consumer_offsets.
Auto commit vs manual commit:
- Auto: simple but at-most-once or at-least-once depending on processing order.
- Manual: commit after processing → at-least-once.
Rebalance when group membership changes — partitions reassigned. Causes brief pause; tune session.timeout.ms, heartbeat.interval.ms, max.poll.interval.ms.
Sticky assignor / cooperative-sticky rebalancing reduces churn.

Ordering

Per-partition ordering only — strictly ordered.
No global ordering across partitions.
For per-entity ordering: use entity id as message key.

Delivery semantics

At most once — commit before processing. Loss on crash.
At least once — commit after processing. Duplicates on crash. Default & most common.
Exactly once — idempotent producer + transactional reads/writes within Kafka. Works inside Kafka boundaries (consume → transform → produce). For external sinks, you still need idempotent consumers.

Storage

Each partition = sequence of segment files on disk.
Retention by time (retention.ms) or size (retention.bytes). Old segments deleted.
Compaction — alternative retention: keep latest value per key. Useful for state snapshots.
Kafka writes sequentially → very fast on spinning disks too.

Replication

Each partition has one leader, N-1 followers.
Leader pushes writes to followers; only counts ack’d replicas (ISR).
Failover: broker dies → controller elects new leader from ISR.
Unclean leader election — allow non-ISR to become leader → data loss. Off by default; turn on for AP.

Common features

Schema Registry (Confluent / Apicurio) — Avro / Protobuf / JSON Schema with compatibility checks.
Kafka Connect — pluggable source/sink connectors (Debezium for CDC, JDBC, S3, Elasticsearch).
Kafka Streams / ksqlDB — stream processing libraries.
MirrorMaker / Cluster Linking — cross-cluster replication.

CLI essentials

kafka-topics --bootstrap-server X --create --topic events --partitions 6 --replication-factor 3
kafka-topics --list --bootstrap-server X
kafka-topics --describe --topic events --bootstrap-server X
kafka-console-producer --topic events --bootstrap-server X
kafka-console-consumer --topic events --from-beginning --bootstrap-server X
kafka-consumer-groups --bootstrap-server X --describe --group my-group
kafka-consumer-groups --bootstrap-server X --reset-offsets --to-earliest --group g --topic t --execute

Alternatives & ecosystem

Redpanda — Kafka API compatible, C++, no JVM, faster.
Apache Pulsar — segment-based, separate broker/storage.
AWS MSK / Confluent Cloud / Aiven — managed.
NATS Jetstream — lighter-weight alternative.
AWS Kinesis / GCP Pub/Sub — cloud-native streams.