Design a daily data pipeline for analytics over conversation logs.
Every day, a JSONL file lands in object storage. The file is associated with a load_date. Each line is one conversation record with top-level fields such as conversation_id, user_id, started_at, model_version, and a messages array. Each message contains fields such as message_id, role, content, timestamp, tokens_in, tokens_out, and latency_ms.
The expected deliverable is pseudo-code rather than production code. Cover the following:
-
Define warehouse schemas or typed column lists for the target tables.
-
Parse and validate each JSONL record.
-
Handle malformed JSON lines, missing fields, unexpected data types, impossible numeric values, and partially invalid records.
-
Deduplicate records within a daily load and across the entire warehouse table.
-
Schedule the daily job and make it idempotent.
-
Explain how you would handle conversations whose messages span more than one calendar day.
-
Explain whether missing numeric values should be defaulted to
0
, stored as
NULL
, or deleted.
-
If the pipeline runs on Spark and one worker has data skew, explain how you would diagnose and mitigate it.
Assume there is a 45-minute live walkthrough with a hiring manager, so your design should be clear enough to explain tradeoffs.