Implement a CSV dataset join
Company: Stripe
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Take-home Project
Quick Answer: This question evaluates a candidate's ability to implement CSV-style dataset joins, exercising skills in string parsing, sorting, join semantics (left join behavior and multiple matches), and handling edge cases in data processing.
Part 1: Basic CSV inner join on a field
Constraints
- `customerFile` and `processorFile` each contain at least one row, which is the header.
- `fieldName` exists in both headers.
- All rows in a file have the same number of comma-separated columns as that file's header.
- Values do not contain quoted commas or escaped commas.
- For this part, each processor join key appears at most once.
Examples
Input: ('id', ['id,name', '2,Bob', '1,Alice'], ['id,city', '1,New York', '2,Los Angeles'])
Expected Output: ['id,name,city', '1,Alice,New York', '2,Bob,Los Angeles']
Explanation: Both files are sorted by `id` before joining, so Alice appears before Bob in the output.
Input: ('id', ['id,name', '3,Cara', '1,Alice', '2,Bob'], ['id,city', '2,LA', '1,NY'])
Expected Output: ['id,name,city', '1,Alice,NY', '2,Bob,LA']
Explanation: Cara has no processor match, so she is omitted in this inner-join version.
Input: ('id', ['id,name'], ['id,city', '1,NY'])
Expected Output: ['id,name,city']
Explanation: Edge case: there are no customer data rows, so only the output header is returned.
Input: ('email', ['name,email', 'Bob,b@example.com', 'Ann,a@example.com'], ['email,score', 'a@example.com,90'])
Expected Output: ['name,email,score', 'Ann,a@example.com,90']
Explanation: The join key does not have to be the first column.
Hints
- Parse the headers first and store the index of the join field in each file.
- After sorting the data rows, a dictionary from processor key to processor row makes matching efficient.
Part 2: Left join with empty processor columns
Constraints
- `customerFile` and `processorFile` each contain at least a header row.
- `fieldName` exists in both headers.
- All rows in a file have the same number of columns as that file's header.
- Values do not contain quoted commas or escaped commas.
- For this part, each processor join key appears at most once.
Examples
Input: ('id', ['id,name', '2,Bob', '1,Alice', '3,Cara'], ['id,city', '1,NY', '2,LA'])
Expected Output: ['id,name,city', '1,Alice,NY', '2,Bob,LA', '3,Cara,']
Explanation: Cara has no match, so her processor-side city value is empty.
Input: ('id', ['id,name', '2,Bob'], ['id,city'])
Expected Output: ['id,name,city', '2,Bob,']
Explanation: Edge case: the processor file has only a header, so every customer row gets empty processor columns.
Input: ('id', ['id,name'], ['id,city', '1,NY'])
Expected Output: ['id,name,city']
Explanation: Edge case: there are no customer data rows.
Input: ('email', ['name,email', 'Bob,b@x', 'Ann,a@x'], ['email,city,zip', 'a@x,Boston,02101'])
Expected Output: ['name,email,city,zip', 'Ann,a@x,Boston,02101', 'Bob,b@x,,']
Explanation: When there are two processor-side columns and no match, both missing values become empty strings.
Hints
- Build the output header by appending only the non-key processor columns.
- When there is no processor match, append the correct number of empty strings before joining the row back into CSV format.
Part 3: Left join with multiple processor matches
Constraints
- `customerFile` and `processorFile` each contain at least a header row.
- `fieldName` exists in both headers.
- All rows in a file have the same number of columns as that file's header.
- Values do not contain quoted commas or escaped commas.
- Processor keys may repeat, and every matching processor row should generate an output row.
Examples
Input: ('id', ['id,name', '2,Bob', '1,Alice'], ['id,city', '1,SF', '1,NY', '2,LA'])
Expected Output: ['id,name,city', '1,Alice,SF', '1,Alice,NY', '2,Bob,LA']
Explanation: Alice matches two processor rows, so she produces two joined output rows.
Input: ('id', ['id,name', '3,Cara', '2,Bob'], ['id,city', '2,LA', '2,SD'])
Expected Output: ['id,name,city', '2,Bob,LA', '2,Bob,SD', '3,Cara,']
Explanation: Bob has two matches, while Cara is unmatched and is kept with an empty processor value.
Input: ('id', ['id,name', '1,Alice', '1,Ava'], ['id,city', '1,NY', '1,SF'])
Expected Output: ['id,name,city', '1,Alice,NY', '1,Alice,SF', '1,Ava,NY', '1,Ava,SF']
Explanation: Each customer row with the same key joins against every processor row with that key.
Input: ('id', ['id,name', '1,Alice'], ['id,city'])
Expected Output: ['id,name,city', '1,Alice,']
Explanation: Edge case: no processor data rows exist, so the customer row remains with empty processor columns.
Hints
- Instead of mapping each processor key to one row, map it to a list of rows.
- Handle unmatched customer rows separately by appending empty strings for all non-key processor columns.
Part 4: Full dataset join with optional skipping of unmatched rows
Constraints
- `customerFile` and `processorFile` each contain at least a header row.
- `fieldName` exists in both headers.
- All rows in a file have the same number of columns as that file's header.
- Values do not contain quoted commas or escaped commas.
- Processor keys may repeat, and every matching processor row must be included.
Examples
Input: ('id', ['id,name', '3,Cara', '1,Alice', '2,Bob'], ['id,city,status', '1,NY,ok', '1,Boston,pending', '2,LA,ok'], False)
Expected Output: ['id,name,city,status', '1,Alice,NY,ok', '1,Alice,Boston,pending', '2,Bob,LA,ok', '3,Cara,,']
Explanation: With `skipUnmatched` set to False, unmatched customer rows remain and processor columns are empty.
Input: ('id', ['id,name', '3,Cara', '1,Alice', '2,Bob'], ['id,city,status', '1,NY,ok', '1,Boston,pending', '2,LA,ok'], True)
Expected Output: ['id,name,city,status', '1,Alice,NY,ok', '1,Alice,Boston,pending', '2,Bob,LA,ok']
Explanation: With `skipUnmatched` set to True, Cara is omitted because she has no processor match.
Input: ('id', ['id,name', '2,Bob'], ['id,city', '1,NY'], True)
Expected Output: ['id,name,city']
Explanation: Edge case: all customer rows are unmatched and `skipUnmatched` is True, so only the header remains.
Input: ('id', ['id,name'], ['id,city', '1,NY'], False)
Expected Output: ['id,name,city']
Explanation: Edge case: there are no customer data rows.
Hints
- This is the same grouping idea as Part 3, but unmatched customer rows now depend on a boolean flag.
- Precompute the list of non-key processor column indexes once so you can reuse it for both matches and empty fills.