Database Stability And SLOs

What's being tested

A strong Software Engineer should be able to reason about database stability as an operational property, not just “is the database up.” Interviewers are probing whether you can define meaningful SLIs, set realistic SLOs, detect degradation early, and connect symptoms like latency spikes, replication lag, lock contention, or disk saturation to concrete engineering actions. Bytedance cares because high-traffic systems depend on databases staying reliable under bursty load, regional failures, schema changes, and deploys. The expected answer is not DBA trivia; it is how a backend/system engineer designs, observes, and operates database-dependent services safely.

Core knowledge

Stability means the database continues serving correct reads/writes within expected latency, durability, and availability bounds under normal and stressed conditions. Assess it across availability, latency, throughput, error rate, replication lag, resource saturation, and data safety.
SLIs should be user-visible or service-visible, not only host-level. Good examples: read_success_rate, write_success_rate, p95/p99 query latency, transaction abort rate, connection pool wait time, deadlock count, replication lag seconds, backup restore success, and failover time.
SLOs turn SLIs into commitments over a window. For example: “99.95% of writes succeed within 200 ms over 30 days” or “replica lag remains below 5 seconds for 99.9% of minutes.” Availability is often computed as $\text{availability} = \frac{\text{successful requests}}{\text{total valid requests}}.$
Error budgets make reliability tradeoffs explicit. If the SLO is 99.9%, the monthly error budget is 0.1% of valid requests or time. A burn rate can be approximated as $\text{burn rate} = \frac{\text{current error rate}}{1 - \text{SLO}}.$ Fast burn alerts catch acute incidents; slow burn alerts catch chronic degradation.
Latency percentiles matter more than averages. Averages hide tail pain: p99 may spike because of lock waits, cache misses, checkpoint stalls, garbage collection, or noisy neighbors while mean latency looks fine. For user-facing services, p95 and p99 usually drive perceived reliability.
Capacity signals should cover both database internals and client pressure. Watch CPU, memory, disk IOPS, disk space, buffer cache hit ratio, active connections, connection pool saturation, queue depth, QPS, slow queries, row scans, index usage, lock waits, and transaction duration.
Scaling decisions depend on workload shape. Read-heavy systems may use read replicas, caching via Redis, or materialized views; write-heavy systems may need batching, partitioning, sharding, denormalization, or reducing transaction scope. Replicas do not solve primary write bottlenecks.
Replication improves availability and read scalability but introduces lag and consistency tradeoffs. With asynchronous replication, a failover can lose recent writes; with synchronous replication, write latency and availability may suffer. Track replication_lag_seconds, replica freshness, and promotion safety.
Backups are only useful if restores are tested. Stability assessment should include recovery point objective, RPO, and recovery time objective, RTO. For example, “RPO < 5 minutes, RTO < 30 minutes” implies continuous WAL/binlog archiving and regular restore drills.
Incident response should connect alerts to runbooks. A useful alert says what user impact is happening, likely causes, and first checks: recent deploys, traffic spikes, slow query logs, lock graphs, connection pools, disk saturation, replication lag, and failover status.
Schema and query changes are common instability sources. Dangerous changes include full-table scans, missing indexes, long migrations, blocking DDL, unbounded pagination, N+1 queries, and new high-cardinality indexes. Prefer online migrations, backfills in batches, feature flags, and query plans checked with EXPLAIN.
Graceful degradation protects the database during overload. Techniques include request throttling, circuit breakers, load shedding, bounded retries with exponential backoff and jitter, cache fallback, read-only mode, and idempotency keys for retried writes.

Worked example

For “How do you assess database system stability?”, a strong candidate first clarifies scope: “Are we assessing a single Postgres/MySQL instance, a replicated cluster, or a sharded service backing a user-facing product?” They should also ask whether the goal is ongoing monitoring, pre-launch readiness, or post-incident diagnosis, because each emphasizes different signals.

A clean answer can be organized around four pillars: service-level behavior, database health, data safety, and operational readiness. For service-level behavior, define SLIs such as read/write success rate, p95/p99 latency, timeout rate, and connection pool wait time, then compare them to SLOs over a rolling window. For database health, inspect CPU, memory, disk I/O, lock waits, transaction duration, slow queries, index hit rate, deadlocks, and saturation trends rather than only current utilization.

For data safety, mention replication health, backup freshness, restore testing, RPO/RTO, and whether failover has been exercised. For operational readiness, cover alert quality, runbooks, dashboards, on-call escalation, load testing, capacity forecasts, and safe deployment practices for schema/query changes.

One explicit tradeoff to flag: pushing for very tight p99 latency SLOs may require overprovisioning, aggressive caching, or simpler consistency guarantees, while looser SLOs may be acceptable for internal or asynchronous workloads. A strong close would be: “If I had more time, I’d compare recent SLO burn against deploy history and traffic growth, then run a restore/failover drill to validate that our reliability assumptions are real.”

A second angle

If the interviewer reframes the topic as “how would you improve an unstable database-backed service,” the answer should shift from assessment to prioritization. Start with the highest user-impacting SLI breach: for example, write timeouts or p99 latency, not generic CPU usage. Then isolate whether the bottleneck is client-side connection pooling, bad queries, lock contention, disk I/O, replication lag, or insufficient capacity. The same SLO framework still applies, but now it guides mitigation order: stop the bleeding with throttling or rollback, then fix root causes with indexing, query changes, partitioning, caching, or capacity changes. The best answers separate short-term mitigation from durable prevention.

Common pitfalls

Pitfall: Saying “I would monitor CPU, memory, and disk” and stopping there.

That answer is too infrastructure-centric. Better: start from user-visible SLIs like success rate and tail latency, then use host and database metrics to explain why those SLIs are degrading.

Pitfall: Treating backups, replication, and availability as the same thing.

Replication helps with failover and read scaling, but it can replicate corrupt writes or lag behind the primary. Backups protect against deletion, corruption, and disaster recovery, but only if restore procedures are tested against RPO/RTO targets.

Pitfall: Giving a list of tools without an operating model.

Mentioning Prometheus, Grafana, Datadog, or CloudWatch is not enough. Interviewers want to hear what you alert on, which thresholds are tied to SLOs, how you avoid noisy alerts, and what action an engineer takes when the alert fires.

Connections

This topic often pivots into distributed systems reliability, capacity planning, database indexing/query optimization, caching strategy, incident management, and consistency tradeoffs. Be ready to discuss how retries can amplify load, how read replicas affect consistency, and how schema migrations can destabilize production systems.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts