PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/Verkada

Design Camera Health Monitoring

Last updated: Jun 17, 2026

Quick Overview

This question evaluates a candidate's competency in designing scalable, near-real-time monitoring systems, covering high-throughput heartbeat ingestion, compact per-device state management, detection of missing events, and handling duplicates, late arrivals, and ordering issues.

  • medium
  • Verkada
  • System Design
  • Software Engineer

Design Camera Health Monitoring

Company: Verkada

Role: Software Engineer

Category: System Design

Difficulty: medium

Interview Round: Technical Screen

Design a system that monitors the health of **10 million cameras** and continuously reports, in near real time, **how many devices are healthy and how many are unhealthy**. Each camera sends one **heartbeat every minute**. A camera is considered **healthy** if the system has received at least one heartbeat from it within the **last 2 minutes**; otherwise it is **unhealthy**. The dashboard needs a live global count (and, where it helps, breakdowns), not per-camera time-series history. Your design should address: - The **heartbeat ingestion path** (how heartbeats are received, validated, and buffered). - **Storage for device state** (what you keep per camera and where). - How to **compute healthy and unhealthy counts** in near real time without rescanning all 10M devices on every tick. - How to handle **duplicate, late, and out-of-order** heartbeats. - **Scalability, fault tolerance, and monitoring.** ```hint Where to start Do the back-of-the-envelope math first. $10^7$ cameras $/\,60\text{ s} \approx 167\text{K}$ heartbeats/sec, but one record per camera is only ~100 bytes, so the *entire* device state is roughly **1 GB** — small. The hard part is not storing state, it's avoiding a per-minute full scan of 10M rows just to recompute two numbers. Separate "ingest is high-throughput" from "state is tiny." ``` ```hint Counting strategy Re-counting two numbers from 10M rows every tick is the obvious approach — what does it cost, and how stale are the numbers between scans? Now ask the opposite question: in a typical minute, how *many* cameras actually change health state? If that number is tiny compared to the fleet, what would you have to react to (instead of re-scan) to keep the counts current? ``` ```hint The timeout problem A heartbeat arriving is an event you can react to. But "this camera went 2 minutes with *no* heartbeat" is the *absence* of an event — nothing shows up to tell you. How do you detect something that is defined by nothing happening, for 10M cameras, without periodically sweeping all of them? What would you have to set up at the moment a heartbeat *does* arrive? ``` ```hint Idempotency under retries Retries mean the same heartbeat can arrive twice, late, or out of order — and you may have already scheduled a "this camera expired" check that a newer heartbeat has since made wrong. Which clock do you trust when the client's is unreliable? And when a possibly-stale expiry check eventually runs, what could you store per camera at heartbeat time that lets the check decide for itself whether it's still relevant? ``` ### Constraints & Assumptions - **Population:** ~10,000,000 provisioned cameras, tracked in a registry, so `total_active` is well-defined and `unhealthy` can be derived/cross-checked as `total_active − healthy`. - **Cadence:** 1 heartbeat/camera/minute ⇒ ~167K heartbeats/sec steady state. Provision for **reconnect storms** (post-deploy/network-blip retries) at roughly 3–5× steady state. - **Health window:** received a heartbeat within the last **120 s**. - **Freshness:** an ops dashboard tolerates a few seconds (≈5–10 s) of count staleness; transactional exactness is **not** required. - **Payload:** a heartbeat is tiny (`camera_id` + timestamp + a few optional fields), ~50–100 B of useful data; the scaling pressure is **request/connection rate**, not bandwidth. - **Hot keys:** one camera = one heartbeat/min, so no single device can be a hot key (relevant to partitioning). - **Scope:** aggregate health counts only — not video ingestion, per-camera dashboards, or per-device alerting (note where the design extends if asked). ### Clarifying Questions to Ask - How fresh must the counts be — sub-second, or is 5–10 s of lag acceptable? (Drives stream-aggregate vs. periodic rollup, and the consistency model.) - Do we need breakdowns by region / firmware / tenant, or only one global number? (Drives counter cardinality.) - Is the device population fixed and registered, or can unknown cameras appear? (Determines whether `unhealthy = total_active − healthy` is valid.) - What does "received a heartbeat" mean — server receive time, or the client-stamped time? (Client clocks are unreliable.) - What delivery guarantees does the transport give (at-least-once, ordering), and how large can a reconnect storm get (fraction of fleet, over what window)? - Are there per-tenant isolation, retention, or compliance requirements on the heartbeat data? ### What a Strong Answer Covers - **Sizing that drives design:** correctly observes that state is small (~1 GB) but throughput is high (~167K/s), and that reconnect storms — not steady state — are the real driver; uses the numbers to make an architectural choice, not just report them. - **Incremental counting:** counts are maintained by reacting to *state transitions*, never by periodic full scans of 10M rows. - **Timeout detection at scale:** a concrete, non-scanning mechanism (timer wheel / expiration buckets / sorted set) co-located per shard, with a recovery story. - **Ingestion path:** stateless ingest behind an LB, fast acknowledgement, and a durable partitioned log decoupling ingest from processing and absorbing bursts. - **Partitioning by `camera_id`:** serialized per-camera updates; state, timers, and counters co-located on the owning shard (no cross-shard coordination for one device). - **Idempotency & ordering:** monotonic `last_seen_ts`/`version` and server-side timestamps make duplicates, late, and out-of-order events safe under at-least-once delivery. - **Fault tolerance:** log replication/replay, state recovery from a changelog or KV, counter reconciliation against authoritative state, and graceful behavior under reconnect storms. - **Read path & monitoring:** O(1) dashboard reads from a counts cache; instrumentation on consumer lag, transition rates, and the `healthy + unhealthy = total_active` invariant. - **Explicit tradeoffs:** names the consistency model chosen (eventual count, strong per-camera ordering) and justifies it for an ops dashboard. ### Follow-up Questions - Extend this to report healthy/unhealthy counts **broken down by region, tenant, or firmware version** without exploding counter cardinality or the read path. What changes in your counters, rollup, and read path? - After a consumer crash and partition rebalance, how exactly does the new owner reconstruct in-memory state, the timer schedule, and the counters — and how do you prevent double-counting a transition during replay? - How do you distinguish "the cameras are actually unhealthy" (a real outage) from "our monitoring pipeline is unhealthy" (consumer lag making counts stale)? What signals and alerts separate the two? - The health threshold is "2 minutes" and the cadence is "once a minute." How would you make these **configurable per tenant or device class** (e.g. a 5-minute cadence, or a 30-second SLA) without redesigning the timeout machinery?

Quick Answer: This question evaluates a candidate's competency in designing scalable, near-real-time monitoring systems, covering high-throughput heartbeat ingestion, compact per-device state management, detection of missing events, and handling duplicates, late arrivals, and ordering issues.

Related Interview Questions

  • Design access control and heartbeat systems - Verkada (medium)
  • Design real-time per-status device counts - Verkada (medium)
  • Design camera access-control service - Verkada (hard)
Verkada logo
Verkada
Apr 2, 2026, 12:00 AM
Software Engineer
Technical Screen
System Design
18
0

Design a system that monitors the health of 10 million cameras and continuously reports, in near real time, how many devices are healthy and how many are unhealthy.

Each camera sends one heartbeat every minute. A camera is considered healthy if the system has received at least one heartbeat from it within the last 2 minutes; otherwise it is unhealthy. The dashboard needs a live global count (and, where it helps, breakdowns), not per-camera time-series history.

Your design should address:

  • The heartbeat ingestion path (how heartbeats are received, validated, and buffered).
  • Storage for device state (what you keep per camera and where).
  • How to compute healthy and unhealthy counts in near real time without rescanning all 10M devices on every tick.
  • How to handle duplicate, late, and out-of-order heartbeats.
  • Scalability, fault tolerance, and monitoring.

Constraints & Assumptions

  • Population: ~10,000,000 provisioned cameras, tracked in a registry, so total_active is well-defined and unhealthy can be derived/cross-checked as total_active − healthy .
  • Cadence: 1 heartbeat/camera/minute ⇒ ~167K heartbeats/sec steady state. Provision for reconnect storms (post-deploy/network-blip retries) at roughly 3–5× steady state.
  • Health window: received a heartbeat within the last 120 s .
  • Freshness: an ops dashboard tolerates a few seconds (≈5–10 s) of count staleness; transactional exactness is not required.
  • Payload: a heartbeat is tiny ( camera_id + timestamp + a few optional fields), ~50–100 B of useful data; the scaling pressure is request/connection rate , not bandwidth.
  • Hot keys: one camera = one heartbeat/min, so no single device can be a hot key (relevant to partitioning).
  • Scope: aggregate health counts only — not video ingestion, per-camera dashboards, or per-device alerting (note where the design extends if asked).

Clarifying Questions to Ask

  • How fresh must the counts be — sub-second, or is 5–10 s of lag acceptable? (Drives stream-aggregate vs. periodic rollup, and the consistency model.)
  • Do we need breakdowns by region / firmware / tenant, or only one global number? (Drives counter cardinality.)
  • Is the device population fixed and registered, or can unknown cameras appear? (Determines whether unhealthy = total_active − healthy is valid.)
  • What does "received a heartbeat" mean — server receive time, or the client-stamped time? (Client clocks are unreliable.)
  • What delivery guarantees does the transport give (at-least-once, ordering), and how large can a reconnect storm get (fraction of fleet, over what window)?
  • Are there per-tenant isolation, retention, or compliance requirements on the heartbeat data?

What a Strong Answer Covers

  • Sizing that drives design: correctly observes that state is small (~1 GB) but throughput is high (~167K/s), and that reconnect storms — not steady state — are the real driver; uses the numbers to make an architectural choice, not just report them.
  • Incremental counting: counts are maintained by reacting to state transitions , never by periodic full scans of 10M rows.
  • Timeout detection at scale: a concrete, non-scanning mechanism (timer wheel / expiration buckets / sorted set) co-located per shard, with a recovery story.
  • Ingestion path: stateless ingest behind an LB, fast acknowledgement, and a durable partitioned log decoupling ingest from processing and absorbing bursts.
  • Partitioning by camera_id: serialized per-camera updates; state, timers, and counters co-located on the owning shard (no cross-shard coordination for one device).
  • Idempotency & ordering: monotonic last_seen_ts / version and server-side timestamps make duplicates, late, and out-of-order events safe under at-least-once delivery.
  • Fault tolerance: log replication/replay, state recovery from a changelog or KV, counter reconciliation against authoritative state, and graceful behavior under reconnect storms.
  • Read path & monitoring: O(1) dashboard reads from a counts cache; instrumentation on consumer lag, transition rates, and the healthy + unhealthy = total_active invariant.
  • Explicit tradeoffs: names the consistency model chosen (eventual count, strong per-camera ordering) and justifies it for an ops dashboard.

Follow-up Questions

  • Extend this to report healthy/unhealthy counts broken down by region, tenant, or firmware version without exploding counter cardinality or the read path. What changes in your counters, rollup, and read path?
  • After a consumer crash and partition rebalance, how exactly does the new owner reconstruct in-memory state, the timer schedule, and the counters — and how do you prevent double-counting a transition during replay?
  • How do you distinguish "the cameras are actually unhealthy" (a real outage) from "our monitoring pipeline is unhealthy" (consumer lag making counts stale)? What signals and alerts separate the two?
  • The health threshold is "2 minutes" and the cadence is "once a minute." How would you make these configurable per tenant or device class (e.g. a 5-minute cadence, or a 30-second SLA) without redesigning the timeout machinery?

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Verkada•More Software Engineer•Verkada Software Engineer•Verkada System Design•Software Engineer System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.