How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a medium difficulty System Design question, commonly asked during Technical Screen rounds at Verkada.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Verkada during technical interviews.

Design Camera Health Monitoring | Verkada Interview Question

Q: Design Camera Health Monitoring

This question evaluates a candidate's competency in designing scalable, near-real-time monitoring systems, covering high-throughput heartbeat ingestion, compact per-device state management, detection of missing events, and handling duplicates, late arrivals, and ordering issues.

Design a system that monitors the health of 10 million cameras and continuously reports, in near real time, how many devices are healthy and how many are unhealthy.

Each camera sends one heartbeat every minute. A camera is considered healthy if the system has received at least one heartbeat from it within the last 2 minutes; otherwise it is unhealthy. The dashboard needs a live global count (and, where it helps, breakdowns), not per-camera time-series history.

Your design should address:

The heartbeat ingestion path (how heartbeats are received, validated, and buffered).
Storage for device state (what you keep per camera and where).
How to compute healthy and unhealthy counts in near real time without rescanning all 10M devices on every tick.
How to handle duplicate, late, and out-of-order heartbeats.
Scalability, fault tolerance, and monitoring.

Constraints & Assumptions

Population: ~10,000,000 provisioned cameras, tracked in a registry, so total_active is well-defined and unhealthy can be derived/cross-checked as total_active − healthy .
Cadence: 1 heartbeat/camera/minute ⇒ ~167K heartbeats/sec steady state. Provision for reconnect storms (post-deploy/network-blip retries) at roughly 3–5× steady state.
Health window: received a heartbeat within the last 120 s .
Freshness: an ops dashboard tolerates a few seconds (≈5–10 s) of count staleness; transactional exactness is not required.
Payload: a heartbeat is tiny ( camera_id + timestamp + a few optional fields), ~50–100 B of useful data; the scaling pressure is request/connection rate , not bandwidth.
Hot keys: one camera = one heartbeat/min, so no single device can be a hot key (relevant to partitioning).
Scope: aggregate health counts only — not video ingestion, per-camera dashboards, or per-device alerting (note where the design extends if asked).

Clarifying Questions to Ask

How fresh must the counts be — sub-second, or is 5–10 s of lag acceptable? (Drives stream-aggregate vs. periodic rollup, and the consistency model.)
Do we need breakdowns by region / firmware / tenant, or only one global number? (Drives counter cardinality.)
Is the device population fixed and registered, or can unknown cameras appear? (Determines whether unhealthy = total_active − healthy is valid.)
What does "received a heartbeat" mean — server receive time, or the client-stamped time? (Client clocks are unreliable.)
What delivery guarantees does the transport give (at-least-once, ordering), and how large can a reconnect storm get (fraction of fleet, over what window)?
Are there per-tenant isolation, retention, or compliance requirements on the heartbeat data?

What a Strong Answer Covers

Sizing that drives design: correctly observes that state is small (~1 GB) but throughput is high (~167K/s), and that reconnect storms — not steady state — are the real driver; uses the numbers to make an architectural choice, not just report them.
Incremental counting: counts are maintained by reacting to state transitions , never by periodic full scans of 10M rows.
Timeout detection at scale: a concrete, non-scanning mechanism (timer wheel / expiration buckets / sorted set) co-located per shard, with a recovery story.
Ingestion path: stateless ingest behind an LB, fast acknowledgement, and a durable partitioned log decoupling ingest from processing and absorbing bursts.
Partitioning by camera_id: serialized per-camera updates; state, timers, and counters co-located on the owning shard (no cross-shard coordination for one device).
Idempotency & ordering: monotonic last_seen_ts / version and server-side timestamps make duplicates, late, and out-of-order events safe under at-least-once delivery.
Fault tolerance: log replication/replay, state recovery from a changelog or KV, counter reconciliation against authoritative state, and graceful behavior under reconnect storms.
Read path & monitoring: O(1) dashboard reads from a counts cache; instrumentation on consumer lag, transition rates, and the healthy + unhealthy = total_active invariant.
Explicit tradeoffs: names the consistency model chosen (eventual count, strong per-camera ordering) and justifies it for an ops dashboard.

Follow-up Questions

Extend this to report healthy/unhealthy counts broken down by region, tenant, or firmware version without exploding counter cardinality or the read path. What changes in your counters, rollup, and read path?
After a consumer crash and partition rebalance, how exactly does the new owner reconstruct in-memory state, the timer schedule, and the counters — and how do you prevent double-counting a transition during replay?
How do you distinguish "the cameras are actually unhealthy" (a real outage) from "our monitoring pipeline is unhealthy" (consumer lag making counts stale)? What signals and alerts separate the two?
The health threshold is "2 minutes" and the cadence is "once a minute." How would you make these configurable per tenant or device class (e.g. a 5-minute cadence, or a 30-second SLA) without redesigning the timeout machinery?

Design a system that monitors the health of 10 million cameras and continuously reports, in near real time, how many devices are healthy and how many are unhealthy.

Your design should address:

The heartbeat ingestion path (how heartbeats are received, validated, and buffered).
Storage for device state (what you keep per camera and where).
How to compute healthy and unhealthy counts in near real time without rescanning all 10M devices on every tick.
How to handle duplicate, late, and out-of-order heartbeats.
Scalability, fault tolerance, and monitoring.

Constraints & Assumptions

Population: ~10,000,000 provisioned cameras, tracked in a registry, so total_active is well-defined and unhealthy can be derived/cross-checked as total_active − healthy .
Cadence: 1 heartbeat/camera/minute ⇒ ~167K heartbeats/sec steady state. Provision for reconnect storms (post-deploy/network-blip retries) at roughly 3–5× steady state.
Health window: received a heartbeat within the last 120 s .
Freshness: an ops dashboard tolerates a few seconds (≈5–10 s) of count staleness; transactional exactness is not required.
Payload: a heartbeat is tiny ( camera_id + timestamp + a few optional fields), ~50–100 B of useful data; the scaling pressure is request/connection rate , not bandwidth.
Hot keys: one camera = one heartbeat/min, so no single device can be a hot key (relevant to partitioning).
Scope: aggregate health counts only — not video ingestion, per-camera dashboards, or per-device alerting (note where the design extends if asked).

Clarifying Questions to Ask

How fresh must the counts be — sub-second, or is 5–10 s of lag acceptable? (Drives stream-aggregate vs. periodic rollup, and the consistency model.)
Do we need breakdowns by region / firmware / tenant, or only one global number? (Drives counter cardinality.)
Is the device population fixed and registered, or can unknown cameras appear? (Determines whether unhealthy = total_active − healthy is valid.)
What does "received a heartbeat" mean — server receive time, or the client-stamped time? (Client clocks are unreliable.)
What delivery guarantees does the transport give (at-least-once, ordering), and how large can a reconnect storm get (fraction of fleet, over what window)?
Are there per-tenant isolation, retention, or compliance requirements on the heartbeat data?

What a Strong Answer Covers

Sizing that drives design: correctly observes that state is small (~1 GB) but throughput is high (~167K/s), and that reconnect storms — not steady state — are the real driver; uses the numbers to make an architectural choice, not just report them.
Incremental counting: counts are maintained by reacting to state transitions , never by periodic full scans of 10M rows.
Timeout detection at scale: a concrete, non-scanning mechanism (timer wheel / expiration buckets / sorted set) co-located per shard, with a recovery story.
Ingestion path: stateless ingest behind an LB, fast acknowledgement, and a durable partitioned log decoupling ingest from processing and absorbing bursts.
Partitioning by camera_id: serialized per-camera updates; state, timers, and counters co-located on the owning shard (no cross-shard coordination for one device).
Idempotency & ordering: monotonic last_seen_ts / version and server-side timestamps make duplicates, late, and out-of-order events safe under at-least-once delivery.
Fault tolerance: log replication/replay, state recovery from a changelog or KV, counter reconciliation against authoritative state, and graceful behavior under reconnect storms.
Read path & monitoring: O(1) dashboard reads from a counts cache; instrumentation on consumer lag, transition rates, and the healthy + unhealthy = total_active invariant.
Explicit tradeoffs: names the consistency model chosen (eventual count, strong per-camera ordering) and justifies it for an ops dashboard.

Follow-up Questions

Extend this to report healthy/unhealthy counts broken down by region, tenant, or firmware version without exploding counter cardinality or the read path. What changes in your counters, rollup, and read path?
After a consumer crash and partition rebalance, how exactly does the new owner reconstruct in-memory state, the timer schedule, and the counters — and how do you prevent double-counting a transition during replay?
How do you distinguish "the cameras are actually unhealthy" (a real outage) from "our monitoring pipeline is unhealthy" (consumer lag making counts stale)? What signals and alerts separate the two?
The health threshold is "2 minutes" and the cadence is "once a minute." How would you make these configurable per tenant or device class (e.g. a 5-minute cadence, or a 30-second SLA) without redesigning the timeout machinery?

Design Camera Health Monitoring

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP

Design Camera Health Monitoring

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP