You are building a backend service that needs to process two CSV files and then call an external GPT-like API for classification.
Requirements
-
HTTP Endpoint
-
Expose an HTTP endpoint, e.g.
POST /ingest-data
.
-
The client uploads two CSV files in a single request:
-
A typical row in
users.csv
might be:
user_id,name,email
.
-
A typical row in
tasks.csv
might be:
task_id,user_id,description
.
-
CSV Parsing and Local JSON Storage
-
The endpoint should:
-
Receive the two CSV files.
-
Parse them into in-memory data structures (e.g., lists of objects).
-
Serialize each dataset into JSON.
-
Persist the resulting JSON to the local filesystem (e.g.,
users.json
,
tasks.json
).
-
GPT Classification Step
-
After parsing, the service should call an external GPT-like API to classify
one field
in the JSON data. For example:
-
For each task in
tasks.json
, classify the
description
into one of a small set of categories (e.g.,
"bug"
,
"feature"
,
"documentation"
).
-
The GPT API:
-
Is accessed via HTTPS.
-
Takes a text prompt and returns a classification label in JSON.
-
You are free to design the prompt and to decide whether to call the GPT API per-record or in batches, as long as all tasks end up with a classification label.
-
Response
-
After classification, return an HTTP response that includes at least:
-
A success indicator.
-
Basic stats (e.g., number of users, number of tasks processed).
-
Optionally, the enriched
tasks
data with the new classification field.
-
Non-functional Requirements
-
Handle basic validation and error cases (missing file, malformed CSV, GPT API failure).
-
Assume multiple clients may call this endpoint concurrently.
-
The solution should be reasonably testable.
Task
Describe how you would design and implement this endpoint, including:
-
The HTTP API contract (request format, response format).
-
How you handle file uploads and CSV parsing.
-
How you structure the code to write JSON to local storage.
-
How you integrate with the GPT classification API (including error handling and possible batching).
-
Considerations for concurrency, timeouts, and testing.