Message Queues — Theory

Message Queues — Theory (interview deep-dive)

Delivery semantics

At most once — fire and forget. Messages may be lost.
At least once — retried until ack. May see duplicates. Most queues default.
Exactly once — only achievable end-to-end with idempotent consumers + dedupe IDs. Brokers can offer “exactly-once delivery” within their boundaries (Kafka transactions, SQS FIFO dedup window) but external side effects still need consumer-side dedupe.

Ordering

Global order is rarely possible at scale.
Per-partition / per-queue / per-message-group order is achievable.
Choose the unit of ordering = entity that needs strict order (e.g., userId for user events, orderId for order state).

If 100% strict order with 100% throughput → no system gives you both. Trade.

Backpressure & flow control

Producer faster than consumer = queue grows.
Strategies:
- Bounded queue + drop / block producer.
- Auto-scale consumers up to partition count.
- Slow producer via API rate limit.
- Shed load (return 503 to upstream).

A growing lag without action is a path to outage.

Poison messages & DLQ

A message that always fails. Without DLQ, it blocks the queue forever or burns CPU.

Pattern:

Try N times (with backoff).
After N, route to DLQ topic/queue.
Alert on DLQ depth > 0.
Manual inspection → fix → replay.

In SQS: RedrivePolicy { maxReceiveCount: 5, deadLetterTargetArn }. In RabbitMQ: dead-letter exchange + queue. In Kafka: write to *.dlq topic from consumer code.

Idempotency in consumers

Network retries + at-least-once means you’ll see duplicates. Must process safely:

Dedupe by messageId (inbox table).
Conditional updates (UPDATE WHERE state='pending').
Use natural keys where possible (INSERT ... ON CONFLICT DO NOTHING).
Side effects (HTTP, email): wrap in idempotency window or accept rare duplicates.

Outbox pattern (revisited for queues)

Solves dual-write between DB + queue:

In same DB transaction, write business state + insert into outbox table.
Relay process publishes outbox rows; marks published.
Consumer dedupes by message id (inbox).

Without outbox, you can lose events when DB commits but publish fails.

RabbitMQ deep notes

Each queue lives on one node (mirrored adds replicas; quorum queues use Raft).
Connection-per-process is heavyweight; use channels (lightweight) within one connection.
Prefetch (basic.qos) limits in-flight per consumer — set to small number (10-50) for fairness.
tx.select is slow; use publisher confirms instead.
Persistent message + durable queue + publisher confirm + manual ack = strong durability.

SQS deep notes

Visibility timeout is critical: must process and delete within window or message reappears. Tune to longer than worst-case processing time.
Long polling vs short polling — always use long polling (20s) to reduce cost and latency.
Standard queue may redeliver and reorder — must design for it.
FIFO TPS limit per group is real — design MessageGroupId accordingly.
Cost is per request — batching helps.

Redis Streams notes

MAXLEN ~ N to cap memory.
Consumer group + XREADGROUP with > reads new only.
XCLAIM after min-idle-time for stuck messages.
AOF needed for durability; RDB-only loses unflushed.

Common interview Qs

You see message lag growing — debug. More consumers? Slow processing? Downstream slow? Hot partition?
Two consumers got the same message — why and how to handle? Visibility timeout expired (SQS) or redelivery on broker restart. Idempotent consumer.
Need strict order for user X’s events. Use partition / message group keyed by user id.
A broker is down — what happens? Producer either errors immediately or buffers. Decide based on SLA. Many brokers have replicated/HA setups.
How would you migrate from RabbitMQ to Kafka? Run both, dual publish from producers, switch consumers, eventually drop old.
When to NOT use a queue? Synchronous user-facing op; tiny event volumes (DB row + cron is enough); strict global ordering at huge scale.
Difference between fanout and pub/sub? Same idea (broadcast). Fanout exchange in RMQ; multiple SQS queues subscribed to one SNS topic.
What’s the difference between a topic in Kafka vs RabbitMQ? Kafka topic = partitioned log replayed by offset. RabbitMQ topic = exchange type for routing keys.
DLQ pattern: where does the retry counter live? Header on the message; broker-incremented in some setups.
Why might you avoid Kafka exactly-once for cross-system writes? EOS only inside Kafka — DB writes still need idempotency.

Choosing — a quick decision tree

Already on AWS? Standard queue with at-least-once works → SQS.
Need replay + analytics + huge volume → Kafka.
Need rich routing patterns and per-message ack → RabbitMQ.
Lightweight job queue, already on Redis → Redis Streams / BullMQ.
Edge / IoT / mixed messaging → NATS JetStream.
Multi-tenant geo-replicated → Pulsar.

Sizing rules of thumb

Number of consumers ≤ partitions / queues sharded.
Visibility timeout = 2-3× p99 processing time.
Prefetch / poll batch size depends on per-msg processing time × parallelism.
Backlog growth rate × retention time = max disk needed.

Anti-patterns

Using a queue as DB.
Per-message DB connection (use pool).
No DLQ.
Letting one slow message block all others (unbounded retries).
Mixing message types in one queue without versioning.
Synchronous request-reply over a queue when HTTP would do.
Skipping idempotency assuming “exactly once” delivery.