Design a 25,000-CSV ETL pipeline
Company: Natoora
Role: Data Analyst
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Technical Screen
You state that you built an ETL pipeline to preprocess 25,000 CSV files and load the results into a centralized database.
Design the pipeline in a way that makes the term "automated ETL" precise.
Your answer should explain:
- how the files are received in practice, for example manual upload, shared drive export, SFTP drop, cloud object storage, or API-generated extracts;
- what triggers extraction and transformation, for example a cron job, polling workflow, event-driven upload notification, or message queue;
- how you distinguish manual, semi-automated, and fully automated versions of the same workflow;
- how schema validation, type standardization, deduplication, bad-row handling, retries, and idempotency are implemented;
- where the centralized database lives, for example local server, on-premise database, hospital-managed system, or cloud warehouse;
- whether the pipeline is truly productionized and what operational evidence would support that claim.
Assume the files may arrive with inconsistent schemas, duplicate records, late arrivals, and occasional corruption. The target is a single analytics-ready table used by downstream analysts and dashboards.
Quick Answer: This question evaluates data engineering and ETL pipeline design skills, specifically competencies in automation, schema validation, type standardization, deduplication, bad-row handling, retries, idempotency, and operationalization of large-scale CSV ingestion for analytics.