Instrumentation, Logging, Labeling, And Data Quality
Asked of: Data Scientist
Last updated

What's being tested
Interviewers are probing whether you can reason from a product or ML question back to the data-generating process: what should be logged, how labels are created, what can go wrong, and whether the resulting dataset is trustworthy enough for analysis or modeling. Meta cares because product decisions, ranking systems, integrity classifiers, and experiment readouts all depend on high-volume behavioral logs that are imperfect by default. The skill is not “knowing SQL”; it is being able to identify missing events, biased labels, duplicate records, delayed pipelines, privacy constraints, and metric definitions that could invalidate a conclusion. A strong Data Scientist can design instrumentation that supports causal analysis, debugging, monitoring, and long-term product learning.
Core knowledge
-
Start from the decision, not the event. Define the product decision or model target first, then derive required entities, actions, timestamps, actors, surfaces, and context. For example, “improve Reels retention” needs impressions, starts, watch duration, skips, replays, follows, hides, comments, and session boundaries, not just “video_played.”
-
Good event schemas are explicit and stable. A robust log includes
event_name,event_id,user_id,device_id,session_id,surface,object_id,client_timestamp,server_timestamp,app_version,experiment_ids, and request/context metadata. Use schema versioning; never silently change field meaning. -
Client-side and server-side logs have different failure modes. Client logs capture user interactions like scrolls and taps but suffer from offline buffering, ad blockers, app crashes, clock skew, and duplicate sends. Server logs are more reliable for deliveries, ranking requests, payments, and notifications, but may miss user-visible client behavior.
-
Idempotency and deduplication are essential at Meta scale. Use globally unique
event_ids or deterministic composite keys such as(user_id, session_id, event_name, object_id, timestamp_bucket). Stripe’s idempotency-key pattern is the right mental model. Deduplicate before aggregating; otherwise retries inflate engagement. -
Late and out-of-order data are normal. Streaming systems like Kafka/Flink/Samza use event-time windows and watermarks; batch systems like Hive/Spark often backfill partitions. Track freshness SLAs such as “99% of mobile events arrive within 6 hours,” and avoid reading incomplete partitions in experiment analysis.
-
Metric definitions must specify numerator, denominator, unit, and window. “Click-through rate” could mean at event level, user level, session level, or query level. These choices change variance, interpretation, and susceptibility to bot or power-user behavior.
-
Sampling must be designed, logged, and corrected. If only 1% of impressions are logged, record sampling probability and use inverse probability weights: Uniform sampling is simple; stratified sampling is better for rare events like reports, purchases, or harmful content.
-
Approximate algorithms are common for large-scale logging. Exact distinct counts are fine up to roughly millions of IDs in memory-constrained jobs, but for billions use HyperLogLog, with relative error approximately . Bloom filters trade false positives for memory efficiency in membership checks.
-
Data quality checks should be automated and layered. Use tests like non-null constraints, enum validation, uniqueness checks, volume anomaly detection, distribution drift, referential integrity, and freshness checks. Tools such as Great Expectations, AWS Deequ, or internal equivalents prevent silent metric corruption.
-
Labels are measurements, not ground truth. Human labels contain ambiguity, fatigue, policy interpretation differences, and sampling bias. Track annotator ID, policy version, confidence, timestamp, adjudication status, and content context. Measure agreement with Cohen’s or Krippendorff’s , not raw agreement alone.
-
Training labels and evaluation labels should be separated. A model trained on weak labels like user reports may learn reporter bias rather than true harm. Reserve a high-quality, policy-adjudicated gold set for evaluation; use active learning or hard-negative mining carefully to avoid making the eval set unrepresentative.
-
Privacy and compliance constraints shape instrumentation. Minimize collection, avoid unnecessary PII, respect consent and regional rules, and define retention policies. At Meta, logs may need aggregation, access control, differential privacy, or purpose limitation before being usable for analysis or ML.
Worked example
Design logging for a new Reels “Not Interested” feedback feature.
In the first 30 seconds, a strong candidate would clarify the decision: are we measuring adoption of the feature, using it as a ranking signal, evaluating user satisfaction, or training a negative-preference model? They would ask where the button appears, whether it hides the current reel immediately, whether users can undo, and whether the feature is in an A/B test. The answer should be organized around four pillars: event taxonomy, schema design, data quality validation, and downstream use for metrics/modeling. For event taxonomy, they would log impressions, menu opens, “Not Interested” taps, undo actions, hides, subsequent watch behavior, and ranking request IDs so the feedback can be joined to the candidate set served. For schema, they would include user, reel, creator, session, surface, timestamp, app version, experiment arm, request ID, and reason code if the UI asks for one.
A key tradeoff is whether to log only explicit taps or also infer negative preference from skips and short watch time. Explicit feedback is cleaner but sparse and biased toward highly motivated users; implicit signals are dense but noisier and confounded by network latency, accidental swipes, or session interruption. The candidate should flag validation checks: event volume by app version, tap-to-menu-open ratio, impossible sequences like undo without prior tap, duplicate event_ids, and delayed mobile events. They should also mention guardrails, such as monitoring whether the feature disproportionately suppresses new creators or specific content categories. A good close would be: “If I had more time, I’d design a gold-label audit to compare explicit feedback against human judgments and run a logging A/A test before trusting the metric in experiments.”
A second angle
Build labels for a classifier that detects policy-violating content.
The same fundamentals apply, but the center of gravity shifts from behavioral instrumentation to label quality and bias. A strong answer would define the policy taxonomy, sampling frame, annotator guidelines, adjudication process, and evaluation set before talking about modeling. The biggest issue is that observed labels may come from user reports or enforcement actions, which are highly selected and not representative of all content impressions. Instead of optimizing only on reported content, the candidate should propose stratified random sampling from impressions, oversampling rare suspected violations, and using inverse-probability weighting when estimating prevalence. They should also separate training data from a blinded, policy-reviewed gold set to prevent inflated offline metrics.
Common pitfalls
Analytical mistake: treating logs as ground truth.
A tempting answer is, “We’ll log clicks and calculate CTR,” without asking whether impressions were logged consistently, whether retries duplicate clicks, or whether logged impressions were actually visible. A stronger answer explains the full event lifecycle and validates consistency between client events, server delivery logs, and experiment assignment.
Communication mistake: jumping straight into fields.
Listing user_id, timestamp, and event_type is not enough if you have not tied the logging plan to a product decision. Interviewers want to see that you can say, “The decision is whether this feature improves long-term satisfaction, so I need both immediate interactions and downstream retention or hide/report behavior.”
Depth mistake: ignoring label bias.
For ML or integrity problems, candidates often say, “Use human-labeled data,” as if that solves truth. Better answers discuss annotator disagreement, policy drift, class imbalance, prevalence estimation, gold sets, active learning bias, and how labels generated from reports or enforcement systems can encode historical blind spots.
Connections
Interviewers may pivot from instrumentation into experimentation, especially whether missing or delayed logs bias A/B test readouts. They may also move into metric design, causal inference, ranking-system evaluation, privacy-preserving analytics, or ML evaluation under noisy labels. If the discussion touches label bias, expect follow-ups on sampling, calibration, precision/recall tradeoffs, and counterfactual evaluation.
Further reading
- Designing Data-Intensive Applications — Excellent grounding in logs, streams, batch processing, consistency, and failure modes in large-scale data systems.
- Hidden Technical Debt in Machine Learning Systems — Seminal paper on why data dependencies, monitoring, and feedback loops matter in production ML.
- Google Rules of Machine Learning — Practical guidance on instrumentation, labels, launch criteria, and monitoring for ML systems.