Event Instrumentation And Measurement Design — Tech Interview Concept

What's being tested
Ability to design reliable, debuggable telemetry for product measurement: choosing metrics/events, schema and sampling tradeoffs, handling loss/duplication/latency, and protecting privacy while enabling experiments.

Core knowledge

Define numerator/denominator and unit-of-analysis (user, session, device) before choosing events or aggregations.
Event schema: stable event_id, user_id (hashed), client_ts and server_ts, event_type, version, and typed properties.
Ingestion semantics: at-least-once vs exactly-once, dedup via event_id, idempotent writes, and watermark/late-arrival handling.
Sampling strategies: uniform, stratified, deterministic hashing; account for sampling in metric backfills and variance.
Cardinality limits: avoid high-cardinality keys (full URLs, free-text) that explode joins and storage.
Privacy/compliance: PII hashing/salting, PI removal at SDK, consent flags, and retention/expiry policies.
Pipeline tooling/tradeoffs: Kafka for high-throughput streams, Flink/Spark for streaming aggregation, columnar warehouses for aggregates (BigQuery/Parquet); monitor telemetry loss and service-level SLAs.

Worked example — typical interview: "Design instrumentation to measure a new feature's conversion"
Start by clarifying the primary metric (e.g., conversion rate = converted users / exposed users) and unit (user-level, first conversion per user, per-session). Identify required events: exposure (impression) and conversion, each with stable event_id, hashed user_id, client_ts, server_ts, feature_flag_id, and assignment token. State sampling plan (deterministic hashing of user_id to keep consistent treatment) and guardrail metrics (DAU, error rate, page load time). Outline pipeline: client emits events to Kafka with retries and idempotency; stream job deduplicates by event_id, applies sampling weights, joins exposure→conversion within a time window, and writes to daily aggregates for analysis. Note how you’ll validate (end-to-end tests, synthetic events) and monitor (ingestion loss, event-rate vs expected).

A common pitfall
Focusing on raw event counts instead of defining denominators and unit-of-analysis leads to misleading metrics (e.g., counting clicks vs unique users). Another tempting mistake is ignoring telemetry loss and deduplication: assuming client events are complete will bias rates if mobile SDK drops events or sampling is inconsistent. Also avoid adding high-cardinality properties by default—they break joins and inflate storage/costs.

Related concepts