Technical Fundamentals for Non-Technical Product Managers

What's being tested

Interviewers are probing whether you can translate technical constraints into product tradeoffs: prioritize features against reliability, estimate user impact from performance metrics, and communicate clear success criteria to engineers and stakeholders. They want a PM who understands observability, common scalability patterns, rollout safety (feature flags/canaries), and how those choices affect metrics like p95 or DAU. At DoorDash this matters because small latency or reliability regressions directly affect conversion, retention, and operations cost.

Core knowledge

Latency vs Throughput — Latency is per-request time (report p50, p90, p99), throughput is requests/sec; optimizing one can worsen the other, so define which metric maps to user experience first.
SLI / SLO / SLA — An SLI is a signal (e.g., successful checkouts ratio), an SLO is an internal target (99.9% success), and an SLA is a contractual penalty; PMs set SLOs tied to business impact.
Error budget — Translate SLO into an error budget (e.g., 0.1% downtime per month) to prioritize incidents vs launches; spend it consciously during aggressive rollouts.
Observability — Instrumentation must include metrics, logs, and traces; metrics detect issues, traces show latency sources, logs contain context for root cause analysis.
Caching & CDNs — Use cache layers (edge CDN or app Redis) to reduce origin load and latency for read-heavy endpoints; be explicit about TTL, invalidation, and staleness tolerance.
Datastore tradeoffs — Postgres (ACID) for strong consistency and complex queries; NoSQL for high-scale, partition-tolerant needs. Declare consistency needs before choosing storage patterns.
Retries and idempotency — Retries must use exponential backoff and require idempotent endpoints (or idempotency keys) to prevent duplicate side effects like double charges.
Rate limiting & throttling — Protect core systems by setting per-client and global limits; for PMs, choose user-facing behavior (reject vs queue) and fallback UX messaging.
Feature flags & rollout strategies — Use feature flag targeting, percentage rollouts, and canary deployments; pair with automatic rollback triggers tied to SLIs.
Incident & postmortem discipline — Track mean time to detect (MTTD) and mean time to recover (MTTR); PMs should own customer communication, prioritization, and follow-through on action items.
Cost vs performance — Quantify: caching reduces compute but increases infra (mem) cost; horizontally scaling N replicas increases throughput roughly linearly until downstream bottlenecks appear.
Security & privacy basics — Classify data sensitivity, prefer encryption-in-transit and at-rest, and require least-privilege access; PMs must specify compliance constraints early.

Worked example

(Example problem: "Reduce p90 checkout latency by 30% for high-traffic markets")
Start by clarifying scope: define the exact metric (p90 over 7 days), segmentation (logged-in users vs guest), and acceptable customer impact during rollout. Frame the answer around three pillars: measurement, quick wins, and medium-term architecture changes.

Measurement: add instrumentation to break checkout into sub-spans (gateway, payment, inventory) so you can attribute latency.

Quick wins: enable a short cache for product availability and defer nonessential network calls (analytics) from the critical path.

Medium-term: consider asynchronous payment confirmation with optimistic UI and strengthen SLOs with an error budget for controlled experiments. Tradeoff to flag: optimistic UX reduces visible latency but increases complexity in reconciliation and potential support load. Close by proposing guardrails: a feature flag percentage rollout, automated p90 rollback threshold, and a 2-week monitoring window; if more time, you'd run A/B tests measuring conversion lift and customer support volume.

A second angle

(Example problem: "Design a safe rollout plan for a new driver-tracking feature")
Here the same concepts apply but the emphasis shifts to privacy, telemetry volume, and real-time constraints. Start with instrumentation and an SLI (e.g., successful location updates per minute) and an SLO tied to restaurant ETA accuracy. Use a canary rollout—first internal drivers, then a small percentage of production—while monitoring p99 of location-processing latency and storage costs. Because telemetry volume can balloon, include sampling or downsampling decisions upfront and define retention policies. The tradeoffs are privacy vs fidelity (more frequent updates = better ETA but higher cost and privacy surface); as PM, decide acceptable resolution and communicate it to legal/ops.

Common pitfalls

Pitfall: Confusing symptom with cause — blaming "API slowness" when the real issue is downstream DB contention; always instrument to isolate the layer before prescribing fixes.

Pitfall: Overfocusing on averages — using mean latency hides tail latency; communicate p90/p99 and user-facing percentiles tied to experience.

Pitfall: Launching without rollback criteria — shipping a change with no SLO-based rollback rule forces firefighting; define automated thresholds and ownership before rollout.

Connections

Interviewers may pivot into experimentation (how you'd A/B test a performance optimization), analytics (metric definitions and invariants), or ML product tradeoffs (latency vs model complexity for real-time recommendations). Be ready to connect SLOs to business metrics like conversion rate or retention.