Context
You work on a VR product that introduces a new interaction feature called hand waving (the system detects when a user is waving their hand and triggers some UI/interaction). Engineering believes it’s ready to launch, but you (as the Data Scientist) are asked to evaluate readiness.
Task
-
Define “hand-waving accuracy.”
-
What exactly is the unit of evaluation (frame, time window, gesture event, session)?
-
What labels/ground truth are needed and how would you obtain them?
-
Which error types matter most (false positives vs false negatives) and why?
-
What metrics would you report (e.g., precision/recall/F1, latency, stability), and how would you choose thresholds?
-
Connect model/metric quality to product outcomes.
-
Propose a set of
primary
,
diagnostic
, and
guardrail
metrics that tie detection quality to user experience (e.g., engagement, retention, frustration, safety).
-
Describe potential confounders (novelty effects, device differences, power users vs new users) and how you would handle them.
-
Data collection and aggregation.
-
What events/logs would you instrument?
-
How would you aggregate the metrics (per event, per user/day, per session) and why?
-
Diagnostics & visualization.
-
Name at least 2 diagnostic cuts (e.g., lighting, device type, hand-tracking quality, motion intensity) and the plots you’d use to identify failure modes.
-
Launch recommendation & exec reporting.
-
What would you present to a VP-level audience as the single most important KPI (plus 2–3 supporting metrics)?
-
What is your launch decision framework (ship, ship-to-% ramp, delay), and what are the go/no-go criteria?