PracHub
QuestionsPremiumLearningGuidesInterview PrepNEWCoaches

Quick Overview

This question evaluates a candidate's ability to implement CSV-style dataset joins, exercising skills in string parsing, sorting, join semantics (left join behavior and multiple matches), and handling edge cases in data processing.

  • medium
  • Stripe
  • Coding & Algorithms
  • Software Engineer

Implement a CSV dataset join

Company: Stripe

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Take-home Project

Implement `joinDataSet(fieldName, customerFile, processorFile, skipUnmatched)`. You are given two datasets represented as `List<String>`, where: - `customerFile[0]` is the header row for the customer dataset. - `processorFile[0]` is the header row for the processor dataset. - Every remaining element is a data row. - Columns are separated by commas. - You may assume values do not contain escaped commas or quoted commas. Inputs: - `fieldName: String` — the column name used as the join key. - `customerFile: List<String>` - `processorFile: List<String>` - `skipUnmatched: boolean` Return a new `List<String>` representing the joined dataset: - The first element is the output header. - The remaining elements are the output rows. Requirements: 1. The join key `fieldName` exists in both headers. 2. Sort the customer rows and processor rows by `fieldName` before joining. 3. Perform a join from `customerFile` to `processorFile` using `fieldName`. 4. If a customer row has multiple matching processor rows, output multiple joined rows, one for each match. 5. If a customer row has no matching processor row: - when `skipUnmatched == false`, include the customer row and fill the processor-side columns with empty strings; - when `skipUnmatched == true`, omit that customer row entirely. 6. Processor rows with no matching customer row should not appear in the output. 7. The output should contain all customer columns followed by all processor columns from the processor file. If the join key would appear twice, include it only once in the output. In short, this problem evolves through four stages: - Part 1: basic join on `fieldName` - Part 2: left join behavior for unmatched customer rows - Part 3: support multiple matches on the processor side - Part 4: optionally skip unmatched customer rows when `skipUnmatched` is `true` Write the function that produces the joined CSV-style dataset.

Quick Answer: This question evaluates a candidate's ability to implement CSV-style dataset joins, exercising skills in string parsing, sorting, join semantics (left join behavior and multiple matches), and handling edge cases in data processing.

Part 1: Basic CSV inner join on a field

Implement `solution(fieldName, customerFile, processorFile)` for a basic join between two CSV-style datasets stored as lists of strings. The first string in each list is the header row, and every other string is a data row. Columns are separated by commas. Find the column named `fieldName` in both headers, sort both datasets by that field, and join customer rows to processor rows using that key. For this part, assume each processor key appears at most once. Only include customer rows that have a matching processor row. The output header must contain all customer columns followed by all processor columns except the duplicate join key.

Constraints

  • `customerFile` and `processorFile` each contain at least one row, which is the header.
  • `fieldName` exists in both headers.
  • All rows in a file have the same number of comma-separated columns as that file's header.
  • Values do not contain quoted commas or escaped commas.
  • For this part, each processor join key appears at most once.

Examples

Input: ('id', ['id,name', '2,Bob', '1,Alice'], ['id,city', '1,New York', '2,Los Angeles'])

Expected Output: ['id,name,city', '1,Alice,New York', '2,Bob,Los Angeles']

Explanation: Both files are sorted by `id` before joining, so Alice appears before Bob in the output.

Input: ('id', ['id,name', '3,Cara', '1,Alice', '2,Bob'], ['id,city', '2,LA', '1,NY'])

Expected Output: ['id,name,city', '1,Alice,NY', '2,Bob,LA']

Explanation: Cara has no processor match, so she is omitted in this inner-join version.

Input: ('id', ['id,name'], ['id,city', '1,NY'])

Expected Output: ['id,name,city']

Explanation: Edge case: there are no customer data rows, so only the output header is returned.

Input: ('email', ['name,email', 'Bob,b@example.com', 'Ann,a@example.com'], ['email,score', 'a@example.com,90'])

Expected Output: ['name,email,score', 'Ann,a@example.com,90']

Explanation: The join key does not have to be the first column.

Hints

  1. Parse the headers first and store the index of the join field in each file.
  2. After sorting the data rows, a dictionary from processor key to processor row makes matching efficient.

Part 2: Left join with empty processor columns

Implement `solution(fieldName, customerFile, processorFile)` for a CSV-style left join. The first row in each file is the header. Sort the customer and processor data rows by `fieldName`, then join each customer row to the processor row with the same key. For this part, assume each processor key appears at most once. If a customer row has no processor match, keep the customer row and fill all processor-side columns with empty strings. Processor-only rows must never appear. The output header must contain all customer columns followed by all processor columns except the duplicate join key.

Constraints

  • `customerFile` and `processorFile` each contain at least a header row.
  • `fieldName` exists in both headers.
  • All rows in a file have the same number of columns as that file's header.
  • Values do not contain quoted commas or escaped commas.
  • For this part, each processor join key appears at most once.

Examples

Input: ('id', ['id,name', '2,Bob', '1,Alice', '3,Cara'], ['id,city', '1,NY', '2,LA'])

Expected Output: ['id,name,city', '1,Alice,NY', '2,Bob,LA', '3,Cara,']

Explanation: Cara has no match, so her processor-side city value is empty.

Input: ('id', ['id,name', '2,Bob'], ['id,city'])

Expected Output: ['id,name,city', '2,Bob,']

Explanation: Edge case: the processor file has only a header, so every customer row gets empty processor columns.

Input: ('id', ['id,name'], ['id,city', '1,NY'])

Expected Output: ['id,name,city']

Explanation: Edge case: there are no customer data rows.

Input: ('email', ['name,email', 'Bob,b@x', 'Ann,a@x'], ['email,city,zip', 'a@x,Boston,02101'])

Expected Output: ['name,email,city,zip', 'Ann,a@x,Boston,02101', 'Bob,b@x,,']

Explanation: When there are two processor-side columns and no match, both missing values become empty strings.

Hints

  1. Build the output header by appending only the non-key processor columns.
  2. When there is no processor match, append the correct number of empty strings before joining the row back into CSV format.

Part 3: Left join with multiple processor matches

Implement `solution(fieldName, customerFile, processorFile)` for a CSV-style join where the processor file may contain multiple rows with the same join key. Sort both datasets by `fieldName` before joining. For each customer row, output one joined row for every matching processor row. If a customer row has no match, still include it and fill the processor-side columns with empty strings. Processor rows that do not match any customer row must not appear. The output header contains all customer columns followed by all processor columns except the duplicate join key.

Constraints

  • `customerFile` and `processorFile` each contain at least a header row.
  • `fieldName` exists in both headers.
  • All rows in a file have the same number of columns as that file's header.
  • Values do not contain quoted commas or escaped commas.
  • Processor keys may repeat, and every matching processor row should generate an output row.

Examples

Input: ('id', ['id,name', '2,Bob', '1,Alice'], ['id,city', '1,SF', '1,NY', '2,LA'])

Expected Output: ['id,name,city', '1,Alice,SF', '1,Alice,NY', '2,Bob,LA']

Explanation: Alice matches two processor rows, so she produces two joined output rows.

Input: ('id', ['id,name', '3,Cara', '2,Bob'], ['id,city', '2,LA', '2,SD'])

Expected Output: ['id,name,city', '2,Bob,LA', '2,Bob,SD', '3,Cara,']

Explanation: Bob has two matches, while Cara is unmatched and is kept with an empty processor value.

Input: ('id', ['id,name', '1,Alice', '1,Ava'], ['id,city', '1,NY', '1,SF'])

Expected Output: ['id,name,city', '1,Alice,NY', '1,Alice,SF', '1,Ava,NY', '1,Ava,SF']

Explanation: Each customer row with the same key joins against every processor row with that key.

Input: ('id', ['id,name', '1,Alice'], ['id,city'])

Expected Output: ['id,name,city', '1,Alice,']

Explanation: Edge case: no processor data rows exist, so the customer row remains with empty processor columns.

Hints

  1. Instead of mapping each processor key to one row, map it to a list of rows.
  2. Handle unmatched customer rows separately by appending empty strings for all non-key processor columns.

Part 4: Full dataset join with optional skipping of unmatched rows

Implement `solution(fieldName, customerFile, processorFile, skipUnmatched)` for the full CSV dataset join. The first row of each file is the header row. Sort both datasets by `fieldName`, then join customer rows to processor rows using that key. A customer row may match multiple processor rows, and each match must produce a separate output row. If a customer row has no match, then include it with empty processor-side columns when `skipUnmatched` is `False`, or omit it entirely when `skipUnmatched` is `True`. Processor-only rows must never appear. The output header contains all customer columns followed by all processor columns except the duplicate join key.

Constraints

  • `customerFile` and `processorFile` each contain at least a header row.
  • `fieldName` exists in both headers.
  • All rows in a file have the same number of columns as that file's header.
  • Values do not contain quoted commas or escaped commas.
  • Processor keys may repeat, and every matching processor row must be included.

Examples

Input: ('id', ['id,name', '3,Cara', '1,Alice', '2,Bob'], ['id,city,status', '1,NY,ok', '1,Boston,pending', '2,LA,ok'], False)

Expected Output: ['id,name,city,status', '1,Alice,NY,ok', '1,Alice,Boston,pending', '2,Bob,LA,ok', '3,Cara,,']

Explanation: With `skipUnmatched` set to False, unmatched customer rows remain and processor columns are empty.

Input: ('id', ['id,name', '3,Cara', '1,Alice', '2,Bob'], ['id,city,status', '1,NY,ok', '1,Boston,pending', '2,LA,ok'], True)

Expected Output: ['id,name,city,status', '1,Alice,NY,ok', '1,Alice,Boston,pending', '2,Bob,LA,ok']

Explanation: With `skipUnmatched` set to True, Cara is omitted because she has no processor match.

Input: ('id', ['id,name', '2,Bob'], ['id,city', '1,NY'], True)

Expected Output: ['id,name,city']

Explanation: Edge case: all customer rows are unmatched and `skipUnmatched` is True, so only the header remains.

Input: ('id', ['id,name'], ['id,city', '1,NY'], False)

Expected Output: ['id,name,city']

Explanation: Edge case: there are no customer data rows.

Hints

  1. This is the same grouping idea as Part 3, but unmatched customer rows now depend on a boolean flag.
  2. Precompute the list of non-key processor column indexes once so you can reuse it for both matches and empty fills.
Last updated: May 16, 2026

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Assign Reviewers from Changed Files - Stripe (medium)
  • Generate Account Email Notifications - Stripe (medium)
  • Calculate Transaction Fees - Stripe (medium)
  • Build an Account Transfer Ledger - Stripe (medium)
  • Implement Validation and String Compression - Stripe (hard)