Messaging, Event Pipelines, and Delivery Semantics

What's being tested

Interviewers are probing whether you can design reliable distributed workflows where messages, events, files, tasks, or log records move through multiple components without being lost, duplicated incorrectly, or reordered in harmful ways. Strong answers show command of delivery semantics, idempotency, ordering, backpressure, retries, checkpointing, and failure recovery, not just “put a queue in the middle.” Google cares because many production systems are asynchronous: user-facing products, storage systems, migrations, notifications, and batch/stream processing all depend on reasoning precisely about what happens when machines crash, networks partition, or downstream services slow down.

Core knowledge

Delivery semantics describe what the system promises under failure: at-most-once may lose messages, at-least-once may duplicate them, and effectively-once usually means at-least-once delivery plus idempotent side effects. True exactly-once across arbitrary services is rare and expensive.
Idempotency is the default tool for surviving retries. Use stable identifiers such as message_id, task_id, file_chunk_id, or Stripe-style idempotency keys so repeated requests produce the same durable result rather than duplicate sends, writes, or state transitions.
Ordering is usually scoped, not global. Systems like Kafka preserve order within a partition, so choose the partition key carefully: student_id, guardian_id, migration_key, or file_id. Global ordering across millions of events usually limits throughput and availability.
Change data capture uses database logs, such as MySQL binlog or Postgres WAL, to stream committed mutations after an initial snapshot. A safe migration pairs a consistent snapshot with CDC from a known log offset, then applies changes idempotently in commit order per key.
Checkpointing records durable progress, often as offsets, sequence numbers, or high-water marks. A worker should commit progress only after downstream side effects are safely persisted; otherwise, a crash can create data loss. Committing late creates duplicates but is usually safer.
Backpressure prevents overload from cascading. Use bounded queues, worker concurrency limits, rate limiters, and consumer lag alerts. Throughput planning starts with $\text{required workers} \approx \frac{\lambda \times \text{avg service time}}{\text{target utilization}}$ where $\lambda$ is arrival rate.
Retries need bounded policy, not infinite loops. Use exponential backoff with jitter, classify errors as transient versus permanent, and route poison messages to a dead-letter queue after N attempts. Retrying a malformed event forever can block a partition.
Acknowledgments should reflect the business state, not just transport success. For school-to-guardian messaging, “accepted by SMS provider,” “delivered to device,” and “guardian acknowledged” are different events with different SLAs, audit requirements, and retry behavior.
Atomicity boundaries determine whether semantics are believable. If a worker reads from Kafka, writes to Bigtable, and calls an external SMS API, there is no single transaction across all three. Design for compensation, dedupe tables, and observable state machines.
State machines make asynchronous workflows debuggable. Model states like PENDING, SENT, DELIVERED, ACKED, FAILED, or CANCELLED, and enforce legal transitions with version checks or compare-and-swap. This avoids ambiguous “boolean flags” under retries.
Partitioning and hot keys dominate scalability. A single celebrity school broadcast, large file, or popular tenant can overload one shard if partitioned poorly. Mitigations include salting, hierarchical fanout, batching, or isolating heavy tenants into separate queues.
Observability must expose correctness and latency: consumer lag, retry rate, duplicate suppression count, p50/p95/p99 end-to-end latency, DLQ depth, per-partition skew, acknowledgment rate, and state-transition errors. Logs should include correlation IDs across producers, queues, workers, and sinks.

Worked example

For Design school-to-guardian messaging with acknowledgments, start by clarifying scope: “Are messages one-way alerts, or do guardians need explicit acknowledgments? What channels are required — push, SMS, email? What is the expected scale: schools, guardians, messages per minute, and peak emergency broadcast behavior?” Then declare assumptions: messages are generated by authorized school staff, each message targets one or more guardians, and the system must track both delivery attempts and guardian acknowledgments.

A strong answer can be organized around four pillars: data model, delivery pipeline, acknowledgment workflow, and reliability/observability. The data model includes Message, Recipient, DeliveryAttempt, and Acknowledgment, with stable IDs to dedupe retries. The delivery pipeline uses a durable queue such as Pub/Sub or Kafka, workers partitioned by recipient or message, and channel adapters for SMS/email/push with provider-specific callbacks normalized into internal events.

The acknowledgment workflow is a state machine: CREATED -> QUEUED -> SENT -> DELIVERED -> ACKED or FAILED, with timestamps and actor identity for auditability. Retries are safe because workers use idempotency keys when calling channel providers and update state with conditional writes. One explicit tradeoff to flag: if an emergency broadcast targets 100,000 guardians, strict per-school ordering may slow fanout; you might preserve ordering per guardian while allowing broad parallelism across recipients. Close by saying that with more time you would cover privacy controls, abuse prevention, regional data residency, and operational runbooks for provider outages.

A second angle

For Design distributed log storage service, the same ideas appear at a lower infrastructure layer. Instead of guardian acknowledgments, the core unit is an append record with an offset; correctness depends on durable replication, leader election, retention, and ordered reads within partitions. Delivery semantics become producer acknowledgments such as acks=1 versus acks=all, consumer offsets, and replay guarantees. The design pressure shifts from user workflow clarity to throughput, disk layout, segment compaction, and recovery after broker failure, but the same vocabulary — partitioning, idempotent producers, checkpoints, lag, and backpressure — still drives the answer.

Common pitfalls

Pitfall: Claiming “exactly-once” without defining the boundary.

A tempting but weak answer is “use Kafka exactly-once semantics, so there are no duplicates.” Better: explain whether exactly-once applies only between Kafka topics, or whether it also covers database writes, file writes, notifications, and external APIs. Interviewers reward candidates who say “we provide effectively-once observable behavior using idempotent sinks and dedupe keys.”

Pitfall: Designing only the happy path.

Many candidates describe producer → queue → worker → database and stop. A stronger answer walks through concrete failures: worker crashes after side effect but before offset commit, provider returns timeout but still sends SMS, one partition contains a poison message, or a migration snapshot races with CDC. These are the moments where delivery semantics actually matter.

Pitfall: Over-constraining ordering and hurting scalability.

Global ordering sounds clean but often becomes a bottleneck. For most systems, the right question is “what entity requires ordered behavior?” Per-key ordering for a user, file, row, task DAG, or log partition is usually enough; global serialization should be justified by a hard product or correctness requirement.

Connections

Interviewers may pivot from messaging into distributed transactions, consensus and replication, stream processing, or workflow orchestration. Be ready to discuss Raft-style leader replication, two-phase commit limitations, Dataflow/Flink windowing and checkpointing at a high level, and DAG execution patterns used by systems like Airflow or Temporal.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Practice questions

Related concepts