System Design Prompt: Distributed Metadata Catalog and Schema Registry
Context
Design a multi-tenant distributed Metadata Catalog and Schema Registry for an analytics platform. The service manages databases, schemas, tables, columns, views, roles, and grants. It must support high read throughput, transactional DDL updates, and change notifications to downstream systems (e.g., query planner, caches).
Begin by clarifying requirements (functional and non-functional), then design the system end to end.
Tasks
-
Clarify requirements
-
Enumerate core use cases and access patterns.
-
Define SLAs/SLOs (latency, availability, durability, multi-region needs, data retention).
-
Specify consistency expectations and isolation levels.
-
Identify multi-tenancy constraints and per-tenant limits.
-
External APIs
-
Propose REST/gRPC endpoints for CRUD on entities, conditional updates, transactions, and change-subscription.
-
Show request/response shapes at a high level, including idempotency and versioning.
-
Data model
-
Define entities and relationships (normalized vs denormalized).
-
Describe versioning, soft deletes, and change-log design.
-
Sharding and replication strategy
-
Explain partitioning keys and how to minimize cross-shard transactions.
-
Choose replication factor, placement (AZ/region), and read/write topology.
-
Consistency model
-
Justify strong vs eventual consistency per operation type.
-
Describe transaction approach (single-shard vs multi-shard).
-
Read/write paths
-
Describe end-to-end request flow for reads and writes, including cache interaction and index maintenance.
-
Failure handling
-
Timeouts, retries, idempotency, partial-failure handling, dead letter policies.
-
Leader election and membership changes.
-
Load management
-
Backpressure/admission control, hot-key mitigation, rate limiting.
-
Caching strategy (L1/L2, invalidation/signaling).
-
Schema evolution
-
Compatibility rules, online migrations, and rolling upgrades.
-
Observability
-
Key metrics, logs, tracing, and alerting tied to SLOs/error budgets.
-
Security and compliance
-
Authentication, authorization (RBAC), encryption, audit logging, tenant isolation.
-
Capacity planning
-
Provide a simple sizing model with example numbers (QPS, storage, replication overhead, headroom).
-
High-level architecture and trade-offs
-
Present the component diagram verbally and discuss major trade-offs (latency vs availability, complexity vs robustness, cost vs performance).