Kafka — Basics
Apache Kafka — Basics
Section titled “Apache Kafka — Basics”What it is
Section titled “What it is”Distributed, append-only log. High throughput, durable, replayable. Used for event streaming, message bus, log aggregation, CDC, stream processing.
Core concepts
Section titled “Core concepts”- Topic — named stream of records (~ table).
- Partition — ordered, immutable log inside a topic. Unit of parallelism.
- Offset — sequential id within a partition.
- Broker — Kafka server. Cluster has many.
- Producer — writes records.
- Consumer — reads records.
- Consumer group — set of consumers cooperating; each partition is read by exactly one consumer in the group.
- Replica — partition copies on other brokers. Leader handles I/O; followers replicate.
- ISR (in-sync replicas) — replicas caught up with leader.
- Controller — coordinator broker (KRaft replaces ZooKeeper).
Producer essentials
Section titled “Producer essentials”- Records have key, value, headers, timestamp.
- Key determines partition (default:
hash(key) % numPartitions). Same key → same partition → ordering guarantee for that key. - Acks:
acks=0— fire and forget; lost on failure.acks=1— leader ack only; lost if leader fails before replication.acks=all— wait for ISR; safest. Pair withmin.insync.replicas=2.
- Idempotent producer (
enable.idempotence=true) — dedupes within session. Default in 3.0+. - Batching (
linger.ms,batch.size) trades latency for throughput.
Consumer essentials
Section titled “Consumer essentials”- Pulls (long-poll) from broker.
- Tracks offsets per partition. Committed offsets stored in
__consumer_offsets. - Auto commit vs manual commit:
- Auto: simple but at-most-once or at-least-once depending on processing order.
- Manual: commit after processing → at-least-once.
- Rebalance when group membership changes — partitions reassigned. Causes brief pause; tune
session.timeout.ms,heartbeat.interval.ms,max.poll.interval.ms. - Sticky assignor / cooperative-sticky rebalancing reduces churn.
Ordering
Section titled “Ordering”- Per-partition ordering only — strictly ordered.
- No global ordering across partitions.
- For per-entity ordering: use entity id as message key.
Delivery semantics
Section titled “Delivery semantics”- At most once — commit before processing. Loss on crash.
- At least once — commit after processing. Duplicates on crash. Default & most common.
- Exactly once — idempotent producer + transactional reads/writes within Kafka. Works inside Kafka boundaries (consume → transform → produce). For external sinks, you still need idempotent consumers.
Storage
Section titled “Storage”- Each partition = sequence of segment files on disk.
- Retention by time (
retention.ms) or size (retention.bytes). Old segments deleted. - Compaction — alternative retention: keep latest value per key. Useful for state snapshots.
- Kafka writes sequentially → very fast on spinning disks too.
Replication
Section titled “Replication”- Each partition has one leader, N-1 followers.
- Leader pushes writes to followers; only counts ack’d replicas (ISR).
- Failover: broker dies → controller elects new leader from ISR.
- Unclean leader election — allow non-ISR to become leader → data loss. Off by default; turn on for AP.
Common features
Section titled “Common features”- Schema Registry (Confluent / Apicurio) — Avro / Protobuf / JSON Schema with compatibility checks.
- Kafka Connect — pluggable source/sink connectors (Debezium for CDC, JDBC, S3, Elasticsearch).
- Kafka Streams / ksqlDB — stream processing libraries.
- MirrorMaker / Cluster Linking — cross-cluster replication.
CLI essentials
Section titled “CLI essentials”kafka-topics --bootstrap-server X --create --topic events --partitions 6 --replication-factor 3kafka-topics --list --bootstrap-server Xkafka-topics --describe --topic events --bootstrap-server Xkafka-console-producer --topic events --bootstrap-server Xkafka-console-consumer --topic events --from-beginning --bootstrap-server Xkafka-consumer-groups --bootstrap-server X --describe --group my-groupkafka-consumer-groups --bootstrap-server X --reset-offsets --to-earliest --group g --topic t --executeAlternatives & ecosystem
Section titled “Alternatives & ecosystem”- Redpanda — Kafka API compatible, C++, no JVM, faster.
- Apache Pulsar — segment-based, separate broker/storage.
- AWS MSK / Confluent Cloud / Aiven — managed.
- NATS Jetstream — lighter-weight alternative.
- AWS Kinesis / GCP Pub/Sub — cloud-native streams.