Modeling Without Labels: End-to-End Plan
You are tasked with shipping an ML model but have no labeled data. Outline a rigorous approach to:
-
Define the label and guard against leakage.
-
Collect or create labels ethically and at scale.
-
Validate label quality and maintain it over time.
Discuss the following components concretely:
-
Instrumentation and logging schemas: event taxonomy, schema/versioning, user/session IDs, consent/PII handling, feature–label joins, time horizons.
-
Heuristic/weak supervision and programmatic labeling: labeling functions, noise-aware aggregation, calibration.
-
Human-in-the-loop pipelines: active learning, rater training, QA, throughput, costs.
-
Proxy labels: when to use, known biases, calibration to true outcomes.
-
Controlled experiments or exploration to elicit outcomes: A/B tests or bandits to ethically gather ground truth with minimal regret.
-
Sampling strategies to reduce bias: stratification, reweighting, handling delayed feedback and censoring.
-
Gold sets and inter-rater agreement: creation, maintenance, and agreement statistics.
-
Continuous data quality monitoring: drift, label delay, schema contracts, alerts.
Provide a step-by-step plan, clear assumptions, and practical validation methods.