How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a easy difficulty System Design question, commonly asked during Onsite rounds at Palantir.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Palantir during technical interviews.

Design a Server Metrics Monitor | Palantir Interview Question

Design a Server Metrics Monitor

Company: Palantir

Role: Software Engineer

Category: System Design

Difficulty: easy

Interview Round: Onsite

Design a **server metrics monitoring system** that periodically pulls metrics from a fleet of servers and makes them available for dashboards and alerting. The system must contact each of **1,000 servers every 10 minutes**, collect metrics such as CPU usage, memory usage, disk usage, and application health, and store the results so they can be queried for dashboards and evaluated for alerts. This is a centralized **pull-based** collector. The interview places special emphasis on the **worker that executes the metric-collection jobs** — expect to write the concurrent collection code, not just draw boxes. The design spans scheduling/orchestration, the concurrent worker pool, failure handling, storage and serving, and how it all scales as the fleet grows. ### Constraints & Assumptions - **Fleet size:** 1,000 servers today; design should accommodate growth to 100,000+. - **Collection cadence:** every 10 minutes per server (one "run" per interval). - **Metrics per server:** a handful of gauges (CPU %, memory %, disk %, app health), tens to low-hundreds of bytes each. - **Collection model:** pull — the monitor reaches out to a small HTTP/agent endpoint on each server. - **Per-server call latency:** typically tens to low-hundreds of milliseconds; some servers are slow or unreachable. - **Deadline:** a run must complete (or be abandoned) comfortably before the next 10-minute interval begins. - **Availability target:** the monitor itself should survive a single process/host crash without losing the schedule or duplicating writes. - Workloads are I/O-bound (network), not CPU-bound. ### Clarifying Questions to Ask - What is the SLA on alert latency — how stale can a metric be before an alert fires (e.g. must a CPU spike alert within one interval)? - Is the per-server endpoint a custom agent we control, a standard exporter (e.g. Prometheus-style `/metrics`), or SSH/SNMP? This shapes the call protocol and timeout budget. - What is the retention requirement (raw points for N days, rolled-up aggregates beyond that)? - How is the server inventory sourced and how often does it churn (autoscaling groups, manual registration, service discovery)? - Do we need exactly-once semantics on writes, or is "at-least-once with idempotent writes" acceptable? - Is multi-region collection required (collectors near the servers), or is a single region acceptable today? ### Part 1 — Scheduling & Run Orchestration Design how a collection **run** is created every 10 minutes and how its lifecycle is tracked. Address: who triggers the run, how a run is identified, how task state is recorded, and how you guarantee **exactly one run per interval** even with multiple scheduler replicas — i.e. how you prevent duplicate or overlapping runs. ```hint Where to start Separate the *scheduler* (decides when, mints a `run_id`) from the *workers* (do the I/O). Anchor the `run_id` to the interval boundary, e.g. floor(now) to the 10-minute bucket, so the same interval always maps to the same identifier. ``` ```hint Single-run guarantee With more than one scheduler replica, two timers can fire at once. Reach for a coordination primitive that makes "create the run" a single-winner operation — a leader lease, or a uniqueness constraint on the run's interval key so the second insert simply fails. ``` ```hint Overlap with the previous run Decide a policy for when the prior run hasn't finished by the next tick: skip, cancel-and-replace, or run concurrently under distinct `run_id`s. Tracking per-run status (`running` / `completed` / `timed_out`) is what lets the scheduler make this call. ``` #### What This Part Should Cover - Scheduler/worker separation and a deterministic, interval-anchored `run_id` scheme. - A concrete mechanism for exactly-one-run-per-interval under replica concurrency (leader election or DB uniqueness), not just "use a lock." - An explicit overlap policy plus a run-status state machine that makes it enforceable. ### Part 2 — Worker Pool & Concurrent Execution (core) This is the focus of the interview. **Write the worker that executes a run**: it is handed the run's targets and must fetch metrics from each server concurrently using a **bounded** worker pool (thread pool or async I/O). Show the actual concurrency structure — task distribution to workers, the bounded queue, and clean shutdown when the work is done or the run deadline is hit. ```hint Concurrency model The work is network-bound, so concurrency >> core count is fine. Bound it with a fixed pool of ~50–200 workers (or async with a semaphore), each pulling targets off a shared queue — not one thread per server. ``` ```hint Bounding & shutdown Use a *bounded* queue and a hard run deadline. Workers loop "pull a target → collect → record" until the queue drains or the deadline passes; a poison-pill / drained-queue signal plus a deadline check gives clean termination instead of a hang on a slow server. ``` ```hint Don't let one server stall the pool Per-request timeouts are what keep a single hung server from pinning a worker for the whole run. Combine the per-call timeout (Part 3) with the run deadline so the pool always makes forward progress. ``` #### What This Part Should Cover - A bounded pool (thread pool / async + semaphore) sized for I/O concurrency, with a shared bounded work queue — not unbounded fan-out or thread-per-server. - Concrete, runnable worker-loop pseudocode: dequeue target, fetch with timeout, write result, record task status, repeat. - Deterministic termination: drain-and-exit on completion and a deadline-driven cutoff, with graceful shutdown. ### Part 3 — Reliability: Timeouts, Retries, Partial Failures, Slow Servers Specify how a single collection task behaves under failure: per-request **timeouts**, a **retry** policy, how **partial failures** (some servers fail) are surfaced rather than failing the whole run, and how **slow servers** are isolated. Define how a failed collection is recorded distinctly from a healthy reading. ```hint Retry discipline Retry only *transient* failures (timeout, connection reset), cap attempts (2–3), and back off with **jitter** to avoid synchronized retry storms. Stop retrying once the run deadline is near — a late metric is worthless. ``` ```hint Failure as data A server that didn't answer is not a server reporting 0% CPU. Persist task status (`success` / `timeout` / `error`, attempt count, latency) separately from the metric values so "missing" is a first-class, alertable state. ``` #### What This Part Should Cover - Per-request timeout budget tied to the cadence, and a bounded retry policy with exponential backoff + jitter, gated by the run deadline. - Partial-failure handling: one bad server degrades the run, never fails it; clear `completed_with_errors` outcome. - Explicit distinction between *missing/failed* collection and a *healthy zero* metric, recorded in a task-status store. ### Part 4 — Storage, Querying, Dashboards & Alerting Design the storage and serving layer. Cover the data model for metrics (and for run/task metadata), the choice of store, how dashboards query it, and how alert rules are evaluated — including alerting on *missing* data and on collection-failure rates, not just on metric thresholds. ```hint Store choice & schema Time-stamped numeric gauges per `(server_id, metric_name)` are the canonical fit for a time-series store (downsampling, retention tiers, range queries). Tag points with region/cluster/env so dashboards can slice without scanning. ``` ```hint Alert on absence Threshold rules (CPU > 90% for 5 min) are the easy case. The harder, more valuable rule is "no data from server X for ≥ 2 intervals" — which only works because Part 3 recorded missing-ness as data. ``` #### What This Part Should Cover - A time-series data model for metric points (`server_id`, `metric_name`, `value`, `timestamp`, tags) and a separate run/task-status model. - Justified store choice (time-series DB) and how dashboards query ranges/aggregates efficiently. - Alerting that spans both metric thresholds and operational health (missing data, failure-rate), sourced from stored metrics + task status rather than worker logs. ### Part 5 — Scaling the Fleet Explain how the design evolves as the fleet grows from 1,000 to 100,000+ servers. Address sharding work across multiple collectors, durability of the work queue, partitioning strategy, autoscaling signals, and when to reconsider pull vs. push. ```hint Where the bottleneck moves At 1k servers a single collector with an in-process queue is fine. As you scale, the queue must become durable (survive crashes) and work must shard across collector instances — partition by consistent hash of `server_id` or by region/cluster so each collector owns a slice. ``` ```hint Pull's ceiling A central pull model has a fan-out limit. Past some scale, flip to **push** (agents on each server push to an ingestion endpoint) or a hybrid (agents push metrics, central system still does liveness checks). Idempotent writes are what keep retries/duplicate tasks safe across any of these. ``` #### What This Part Should Cover - Horizontal scaling: durable distributed queue, multiple collectors, and an explicit partitioning/sharding key. - Autoscaling driven by real signals (queue depth, collection latency) and graceful rebalancing when collectors join/leave. - A reasoned pull-to-push (or hybrid) transition with idempotent writes preserved throughout. ### What a Strong Answer Covers Across all parts, a strong candidate frames this not as a throughput problem (1,000 small calls / 10 min is tiny) but as an **orchestration and reliability** problem, and keeps these cross-cutting threads consistent end-to-end: - **Idempotency everywhere:** writes keyed so retries and duplicate tasks are safe, which is what makes the at-least-once queue, retries, and overlap policy all sound. - **Bounded resource use:** bounded pool, bounded queue, hard deadlines — no run can exhaust resources or run past its interval. - **Failure isolation:** one slow or dead server degrades a run, never blocks or fails it. - **Missing ≠ zero:** the distinction between a failed collection and a healthy zero is maintained from the worker through to alerting. - **Pragmatism for the stated scale:** simple solution sized for 1,000 servers, with a clear, justified evolution path to large scale rather than over-engineering up front. ### Follow-up Questions - Walk through exactly what happens to an in-flight run if the single collector process crashes at minute 3 of a 10-minute interval. What is lost, what is recovered, and how (ties to Parts 1–2)? - Two scheduler replicas both believe they are leader for one interval (a split-brain / clock-skew window). What concretely prevents duplicate writes, and what is the worst observable outcome? - A downstream dependency makes 30% of servers respond in 8 seconds instead of 100 ms. How does your Part 2/3 design keep the run within deadline, and what does the dashboard show for the slow servers? - How would you add metric **down-sampling / rollups** so a year of history is queryable cheaply, and where in the pipeline does that aggregation run?

Quick Answer: This question evaluates knowledge of distributed system design, scheduling and orchestration, concurrency, fault tolerance, and observability in the context of large-scale metric collection.

Design a server metrics monitoring system that periodically pulls metrics from a fleet of servers and makes them available for dashboards and alerting.

The system must contact each of 1,000 servers every 10 minutes, collect metrics such as CPU usage, memory usage, disk usage, and application health, and store the results so they can be queried for dashboards and evaluated for alerts.

This is a centralized pull-based collector. The interview places special emphasis on the worker that executes the metric-collection jobs — expect to write the concurrent collection code, not just draw boxes. The design spans scheduling/orchestration, the concurrent worker pool, failure handling, storage and serving, and how it all scales as the fleet grows.

Constraints & Assumptions

Fleet size: 1,000 servers today; design should accommodate growth to 100,000+.
Collection cadence: every 10 minutes per server (one "run" per interval).
Metrics per server: a handful of gauges (CPU %, memory %, disk %, app health), tens to low-hundreds of bytes each.
Collection model: pull — the monitor reaches out to a small HTTP/agent endpoint on each server.
Per-server call latency: typically tens to low-hundreds of milliseconds; some servers are slow or unreachable.
Deadline: a run must complete (or be abandoned) comfortably before the next 10-minute interval begins.
Availability target: the monitor itself should survive a single process/host crash without losing the schedule or duplicating writes.
Workloads are I/O-bound (network), not CPU-bound.

Clarifying Questions to Ask Guidance

What is the SLA on alert latency — how stale can a metric be before an alert fires (e.g. must a CPU spike alert within one interval)?
Is the per-server endpoint a custom agent we control, a standard exporter (e.g. Prometheus-style /metrics ), or SSH/SNMP? This shapes the call protocol and timeout budget.
What is the retention requirement (raw points for N days, rolled-up aggregates beyond that)?
How is the server inventory sourced and how often does it churn (autoscaling groups, manual registration, service discovery)?
Do we need exactly-once semantics on writes, or is "at-least-once with idempotent writes" acceptable?
Is multi-region collection required (collectors near the servers), or is a single region acceptable today?

Part 1 — Scheduling & Run Orchestration

Design how a collection run is created every 10 minutes and how its lifecycle is tracked. Address: who triggers the run, how a run is identified, how task state is recorded, and how you guarantee exactly one run per interval even with multiple scheduler replicas — i.e. how you prevent duplicate or overlapping runs.

What This Part Should Cover Guidance

Scheduler/worker separation and a deterministic, interval-anchored run_id scheme.
A concrete mechanism for exactly-one-run-per-interval under replica concurrency (leader election or DB uniqueness), not just "use a lock."
An explicit overlap policy plus a run-status state machine that makes it enforceable.

Part 2 — Worker Pool & Concurrent Execution (core)

This is the focus of the interview. Write the worker that executes a run: it is handed the run's targets and must fetch metrics from each server concurrently using a bounded worker pool (thread pool or async I/O). Show the actual concurrency structure — task distribution to workers, the bounded queue, and clean shutdown when the work is done or the run deadline is hit.

What This Part Should Cover Guidance

A bounded pool (thread pool / async + semaphore) sized for I/O concurrency, with a shared bounded work queue — not unbounded fan-out or thread-per-server.
Concrete, runnable worker-loop pseudocode: dequeue target, fetch with timeout, write result, record task status, repeat.
Deterministic termination: drain-and-exit on completion and a deadline-driven cutoff, with graceful shutdown.

Part 3 — Reliability: Timeouts, Retries, Partial Failures, Slow Servers

Specify how a single collection task behaves under failure: per-request timeouts, a retry policy, how partial failures (some servers fail) are surfaced rather than failing the whole run, and how slow servers are isolated. Define how a failed collection is recorded distinctly from a healthy reading.

What This Part Should Cover Guidance

Per-request timeout budget tied to the cadence, and a bounded retry policy with exponential backoff + jitter, gated by the run deadline.
Partial-failure handling: one bad server degrades the run, never fails it; clear completed_with_errors outcome.
Explicit distinction between missing/failed collection and a healthy zero metric, recorded in a task-status store.

Part 4 — Storage, Querying, Dashboards & Alerting

Design the storage and serving layer. Cover the data model for metrics (and for run/task metadata), the choice of store, how dashboards query it, and how alert rules are evaluated — including alerting on missing data and on collection-failure rates, not just on metric thresholds.

What This Part Should Cover Guidance

A time-series data model for metric points ( server_id , metric_name , value , timestamp , tags) and a separate run/task-status model.
Justified store choice (time-series DB) and how dashboards query ranges/aggregates efficiently.
Alerting that spans both metric thresholds and operational health (missing data, failure-rate), sourced from stored metrics + task status rather than worker logs.

Part 5 — Scaling the Fleet

Explain how the design evolves as the fleet grows from 1,000 to 100,000+ servers. Address sharding work across multiple collectors, durability of the work queue, partitioning strategy, autoscaling signals, and when to reconsider pull vs. push.

What This Part Should Cover Guidance

Horizontal scaling: durable distributed queue, multiple collectors, and an explicit partitioning/sharding key.
Autoscaling driven by real signals (queue depth, collection latency) and graceful rebalancing when collectors join/leave.
A reasoned pull-to-push (or hybrid) transition with idempotent writes preserved throughout.

What a Strong Answer Covers Guidance

Across all parts, a strong candidate frames this not as a throughput problem (1,000 small calls / 10 min is tiny) but as an orchestration and reliability problem, and keeps these cross-cutting threads consistent end-to-end:

Idempotency everywhere: writes keyed so retries and duplicate tasks are safe, which is what makes the at-least-once queue, retries, and overlap policy all sound.
Bounded resource use: bounded pool, bounded queue, hard deadlines — no run can exhaust resources or run past its interval.
Failure isolation: one slow or dead server degrades a run, never blocks or fails it.
Missing ≠ zero: the distinction between a failed collection and a healthy zero is maintained from the worker through to alerting.
Pragmatism for the stated scale: simple solution sized for 1,000 servers, with a clear, justified evolution path to large scale rather than over-engineering up front.

Follow-up Questions Guidance

Walk through exactly what happens to an in-flight run if the single collector process crashes at minute 3 of a 10-minute interval. What is lost, what is recovered, and how (ties to Parts 1–2)?
Two scheduler replicas both believe they are leader for one interval (a split-brain / clock-skew window). What concretely prevents duplicate writes, and what is the worst observable outcome?
A downstream dependency makes 30% of servers respond in 8 seconds instead of 100 ms. How does your Part 2/3 design keep the run within deadline, and what does the dashboard show for the slow servers?
How would you add metric down-sampling / rollups so a year of history is queryable cheaply, and where in the pipeline does that aggregation run?

Design a Server Metrics Monitor

Company: Palantir

Role: Software Engineer

Category: System Design

Difficulty: easy

Interview Round: Onsite

Design a Server Metrics Monitor

Quick Overview

Design a Server Metrics Monitor

Constraints & Assumptions

Clarifying Questions to Ask Guidance

Part 1 — Scheduling & Run Orchestration

What This Part Should Cover Guidance

Part 2 — Worker Pool & Concurrent Execution (core)

What This Part Should Cover Guidance

Part 3 — Reliability: Timeouts, Retries, Partial Failures, Slow Servers

What This Part Should Cover Guidance

Part 4 — Storage, Querying, Dashboards & Alerting

What This Part Should Cover Guidance

Part 5 — Scaling the Fleet

What This Part Should Cover Guidance

What a Strong Answer Covers Guidance

Follow-up Questions Guidance

Submit Your Answer to Earn 20XP

Design a Server Metrics Monitor

Quick Overview

Design a Server Metrics Monitor

Constraints & Assumptions

Clarifying Questions to Ask Guidance

Part 1 — Scheduling & Run Orchestration

What This Part Should Cover Guidance

Part 2 — Worker Pool & Concurrent Execution (core)

What This Part Should Cover Guidance

Part 3 — Reliability: Timeouts, Retries, Partial Failures, Slow Servers

What This Part Should Cover Guidance

Part 4 — Storage, Querying, Dashboards & Alerting

What This Part Should Cover Guidance

Part 5 — Scaling the Fleet

What This Part Should Cover Guidance

What a Strong Answer Covers Guidance

Follow-up Questions Guidance

Submit Your Answer to Earn 20XP