PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/Cohere

Design a Conversation Log Ingestion Pipeline

Last updated: Jun 13, 2026

Quick Overview

This question evaluates a data engineer's competency in designing robust daily ingestion and ETL pipelines for conversation logs, covering schema and warehouse design, JSON parsing and validation, deduplication and idempotency, error handling, and operational diagnostics such as Spark data skew.

  • medium
  • Cohere
  • System Design
  • Data Engineer

Design a Conversation Log Ingestion Pipeline

Company: Cohere

Role: Data Engineer

Category: System Design

Difficulty: medium

Interview Round: Technical Screen

Design a daily data pipeline for analytics over conversation logs. Every day, a JSONL file lands in object storage. The file is associated with a `load_date`. Each line is one conversation record with top-level fields such as `conversation_id`, `user_id`, `started_at`, `model_version`, and a `messages` array. Each message contains fields such as `message_id`, `role`, `content`, `timestamp`, `tokens_in`, `tokens_out`, and `latency_ms`. The expected deliverable is pseudo-code rather than production code. Cover the following: 1. Define warehouse schemas or typed column lists for the target tables. 2. Parse and validate each JSONL record. 3. Handle malformed JSON lines, missing fields, unexpected data types, impossible numeric values, and partially invalid records. 4. Deduplicate records within a daily load and across the entire warehouse table. 5. Schedule the daily job and make it idempotent. 6. Explain how you would handle conversations whose messages span more than one calendar day. 7. Explain whether missing numeric values should be defaulted to `0`, stored as `NULL`, or deleted. 8. If the pipeline runs on Spark and one worker has data skew, explain how you would diagnose and mitigate it. Assume there is a 45-minute live walkthrough with a hiring manager, so your design should be clear enough to explain tradeoffs.

Quick Answer: This question evaluates a data engineer's competency in designing robust daily ingestion and ETL pipelines for conversation logs, covering schema and warehouse design, JSON parsing and validation, deduplication and idempotency, error handling, and operational diagnostics such as Spark data skew.

Cohere logo
Cohere
May 30, 2026, 12:00 AM
Data Engineer
Technical Screen
System Design
1
0

Design a daily data pipeline for analytics over conversation logs.

Every day, a JSONL file lands in object storage. The file is associated with a load_date. Each line is one conversation record with top-level fields such as conversation_id, user_id, started_at, model_version, and a messages array. Each message contains fields such as message_id, role, content, timestamp, tokens_in, tokens_out, and latency_ms.

The expected deliverable is pseudo-code rather than production code. Cover the following:

  1. Define warehouse schemas or typed column lists for the target tables.
  2. Parse and validate each JSONL record.
  3. Handle malformed JSON lines, missing fields, unexpected data types, impossible numeric values, and partially invalid records.
  4. Deduplicate records within a daily load and across the entire warehouse table.
  5. Schedule the daily job and make it idempotent.
  6. Explain how you would handle conversations whose messages span more than one calendar day.
  7. Explain whether missing numeric values should be defaulted to 0 , stored as NULL , or deleted.
  8. If the pipeline runs on Spark and one worker has data skew, explain how you would diagnose and mitigate it.

Assume there is a 45-minute live walkthrough with a hiring manager, so your design should be clear enough to explain tradeoffs.

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Cohere•More Data Engineer•Cohere Data Engineer•Cohere System Design•Data Engineer System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.