How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a medium difficulty System Design question, commonly asked during Technical Screen rounds at Cohere.

What role is this question designed for?

This question is commonly asked for Data Engineer candidates at Cohere during technical interviews.

Design a Conversation Log Ingestion Pipeline

Quick Overview

This question evaluates a data engineer's competency in designing robust daily ingestion and ETL pipelines for conversation logs, covering schema and warehouse design, JSON parsing and validation, deduplication and idempotency, error handling, and operational diagnostics such as Spark data skew.

Design a daily data pipeline for analytics over conversation logs.

Every day, a JSONL file lands in object storage. The file is associated with a load_date. Each line is one conversation record with top-level fields such as conversation_id, user_id, started_at, model_version, and a messages array. Each message contains fields such as message_id, role, content, timestamp, tokens_in, tokens_out, and latency_ms.

The expected deliverable is pseudo-code rather than production code. Cover the following:

Define warehouse schemas or typed column lists for the target tables.
Parse and validate each JSONL record.
Handle malformed JSON lines, missing fields, unexpected data types, impossible numeric values, and partially invalid records.
Deduplicate records within a daily load and across the entire warehouse table.
Schedule the daily job and make it idempotent.
Explain how you would handle conversations whose messages span more than one calendar day.
Explain whether missing numeric values should be defaulted to 0 , stored as NULL , or deleted.
If the pipeline runs on Spark and one worker has data skew, explain how you would diagnose and mitigate it.

Assume there is a 45-minute live walkthrough with a hiring manager, so your design should be clear enough to explain tradeoffs.

Quick Overview

Design a daily data pipeline for analytics over conversation logs.

The expected deliverable is pseudo-code rather than production code. Cover the following:

Define warehouse schemas or typed column lists for the target tables.
Parse and validate each JSONL record.
Handle malformed JSON lines, missing fields, unexpected data types, impossible numeric values, and partially invalid records.
Deduplicate records within a daily load and across the entire warehouse table.
Schedule the daily job and make it idempotent.
Explain how you would handle conversations whose messages span more than one calendar day.
Explain whether missing numeric values should be defaulted to 0 , stored as NULL , or deleted.
If the pipeline runs on Spark and one worker has data skew, explain how you would diagnose and mitigate it.

Assume there is a 45-minute live walkthrough with a hiring manager, so your design should be clear enough to explain tradeoffs.

Design a Conversation Log Ingestion Pipeline

Quick Overview

Solution

Submit Your Answer to Earn 20XP

Design a Conversation Log Ingestion Pipeline

Quick Overview

Solution

Submit Your Answer to Earn 20XP