Instrumentation, Logging, And Data Quality
Asked of: Data Scientist
Last updated
What's being tested
Interviewers are testing whether you can reason from product behavior to reliable analytical data: what events should be logged, how they should be defined, how they can fail, and how you would validate them before making product decisions. For a Meta Data Scientist, this matters because metrics like impressions, clicks, sessions, shares, watch time, and notifications often come from complex client/server instrumentation rather than clean transactional tables. The interviewer is probing whether you understand that “data exists” is not the same as “data is decision-grade.” Strong answers combine product intuition, logging design, data engineering awareness, and statistical sanity checks.
Core knowledge
-
Instrumentation starts with a precise event taxonomy: event name, actor, object, surface, timestamp, context, and trigger condition. For example,
feed_impressionshould specify whether it fires when content is rendered, visible for 1 pixel, visible for 50% of area, or visible for 1+ seconds. -
Separate client-side and server-side logging tradeoffs. Client logs capture actual user experience, viewport visibility, scroll depth, and latency, but suffer from dropped events, offline buffering, ad blockers, and app-version skew. Server logs are more reliable but may count delivered content that the user never actually saw.
-
Every event needs a stable identifier strategy. Common IDs include
user_id,device_id,session_id,request_id,surface_id,content_id,experiment_id, andevent_id.event_idenables deduplication;request_idlets you join ranking, delivery, impression, and engagement logs. -
Idempotency and deduplication are essential. Mobile clients may retry uploads, causing duplicate events. A common pattern is to generate a UUID client-side and dedupe on
(event_id)or(user_id, event_type, object_id, timestamp_bucket)when event IDs are unavailable, accepting some false merges. -
Event-time and processing-time are different. Event-time is when the user action occurred; processing-time is when the pipeline ingested it. Late-arriving events affect dashboards, experiment metrics, and retention calculations. Pipelines often use watermarking, e.g., “finalize daily aggregates after 24–48 hours.”
-
Data quality checks should cover volume, schema, freshness, null rates, cardinality, joins, distributions, and business invariants. Examples: click events should not exceed impression events for the same surface by large margins; required fields should have nulls; DAU should not move 10% without a launch or outage.
-
Logging changes need versioning. Add fields in a backward-compatible way, maintain
client_versionandschema_version, and avoid silently redefining metrics. If “impression” changes from render-based to viewability-based, create a new event or metric version rather than overwriting history. -
Sampling must be explicit and analyzable. If logging 100% of impressions is too expensive, sample with a known probability and estimate counts using Sampling should be deterministic by user or request when consistency across joined events matters.
-
Approximate counting is common at Meta scale. HyperLogLog estimates distinct users with relative error about where is the number of registers. Use exact counts for small tables or critical billing/experiment metrics; use HLL-like sketches for fast, high-cardinality observability.
-
Validate new instrumentation against independent sources. Compare client impressions to server deliveries, click-through rate to historical baselines, app-version slices, country slices, platform slices, and experiment holdouts. A metric that only breaks on Android 327.1 or low-connectivity regions is easy to miss in aggregate.
-
Beware survivorship and missingness bias. If crashy sessions fail to upload logs, observed engagement may be biased upward. If low-end devices drop video-watch events, product changes may look better among logged users than among all users. Missingness is rarely random in consumer apps.
-
For experimentation, logging must be assignment-aware. Store experiment unit, treatment, assignment timestamp, eligibility, and exposure. Distinguish “assigned to treatment” from “actually exposed to feature”; intention-to-treat metrics are usually safer for launch decisions, while treatment-on-treated analyses require stronger assumptions.
Worked example
For “Design logging for News Feed impressions and clicks”, a strong candidate would first clarify the product surface and metric purpose: “Are we using this for ranking evaluation, experimentation, ads billing, or product health? Do we need exact counts or directional analytics? Should an impression mean delivered, rendered, or viewable?” Then they would declare a reasonable assumption, such as: “I’ll define an organic Feed impression as a story being at least 50% visible in the viewport for at least one second.” The answer can be organized around four pillars: event definitions, identifiers and joins, pipeline reliability, and validation.
For event definitions, they would propose feed_story_delivered, feed_story_impression, and feed_story_click as separate events because delivery, visibility, and engagement answer different questions. For identifiers, they would include user_id, session_id, request_id, story_id, ranking_position, surface, client_timestamp, server_received_timestamp, app_version, and experiment_ids. For reliability, they would discuss client buffering, retry behavior, idempotent event_ids, offline upload delays, and deduplication. For validation, they would compare impression volume to server delivery volume, monitor CTR by platform and app version, and assert that clicks without prior impressions remain below a small threshold.
A key tradeoff is client-side viewability versus server-side delivery logging. Client-side logging better reflects what the user saw, but it is more vulnerable to dropped logs and inconsistent app implementations; server-side logging is more complete but can overcount unseen content. A strong close would be: “If I had more time, I’d define a rollout plan with shadow logging, compare old and new impression definitions for several weeks, and build alerts for sudden shifts by app version, country, and surface.”
A second angle
For “Investigate a sudden drop in Daily Active Users”, the same skills apply, but the framing shifts from designing logs to debugging metric trustworthiness. A strong candidate would avoid immediately assuming a real product decline and would first split the problem into instrumentation, pipeline, and true user behavior. They might check whether the drop aligns with a logging deploy, app release, timezone boundary, ingestion delay, or schema change. Then they would compare DAU against independent signals such as server requests, push notification opens, login events, session starts, and app store crash reports. The constraint is speed: in an incident-style question, the interviewer wants a prioritized debugging tree, not a perfect event schema.
Common pitfalls
An analytical mistake is treating logged events as ground truth without asking how they are generated. For example, saying “DAU dropped, so fewer users came to the app” skips obvious alternatives like late ingestion, authentication logging failures, or a broken session-start event. A better answer triangulates across multiple sources before interpreting product impact.
A communication mistake is jumping into implementation details before defining the metric. “I’d log user ID, timestamp, and click ID to Kafka” is incomplete if you have not specified what counts as an impression, click, session, or active user. Start with the decision the metric supports, then define the event semantics, then discuss systems.
A depth mistake is ignoring edge cases that dominate real consumer data. Offline mobile usage, retries, duplicate uploads, app-version skew, bot traffic, timezones, and privacy constraints can materially change metric interpretation. Mentioning two or three of these concretely signals that you have worked with messy production data rather than textbook tables.
Connections
Interviewers may pivot from instrumentation into experimentation, especially how exposure logging affects intent-to-treat versus treatment-on-treated analysis. They may also move into metric design, anomaly detection, causal inference, privacy-preserving analytics, or debugging product metric regressions. If the conversation turns technical, expect questions on joins, deduplication, late-arriving data, sampling, and approximate distinct counting.
Further reading
- The Data Warehouse Toolkit — Ralph Kimball and Margy Ross — Practical foundation for fact tables, dimensions, grain, and event modeling.
- HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm — Flajolet et al. — Seminal paper behind approximate distinct counting at large scale.
- Designing Data-Intensive Applications — Martin Kleppmann — Strong background on event logs, stream processing, consistency, batch pipelines, and data systems tradeoffs.