Implement Validation and String Compression
Company: Stripe
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: hard
Interview Round: Onsite
Implement the following two coding tasks.
1. **CSV dataset validation**
You are given:
- a multiline string representing CSV data;
- a set of banned words;
- two column indexes `c1` and `c2`;
- a set of stop words.
Assume commas are field separators, rows do not contain quoted commas, and every non-empty row is expected to have the same number of columns.
Write code that processes the input line by line and supports three parts:
- **Part 1:** validate that no cell is empty after trimming whitespace.
- **Part 2:** reject any row containing a banned word in any column, using case-insensitive matching.
- **Part 3:** for columns `c1` and `c2`, tokenize text using non-letter characters as delimiters and count how many tokens are stop words, case-insensitive, across all valid rows.
Discuss how you handle blank lines, malformed rows, trailing commas, and large inputs.
2. **Hierarchical string compression**
You are given a string whose major parts are separated by `/` and whose minor parts inside each major part are separated by `.`. Ignore empty parts created by repeated or trailing separators.
Define `compress(token)` as:
- if `token.length() <= 2`, keep it unchanged;
- otherwise return `first_character + (token.length() - 2) + last_character`.
Implement two parts:
- **Part 1:** compress every token independently and rebuild the string with the same hierarchy.
- **Part 2:** you are also given an integer `m`. If a major part contains more than `m` minor parts, reduce it to exactly `m` groups by keeping the first `m - 1` minor parts as separate groups and concatenating all remaining minor parts into the final group before compressing.
Example:
- Input: `abcd/erfgsh/google.com.abc.`
- Part 1 output: `a2d/e4h/g4e.c1m.a1c`
- If `m = 2`, then `google.com.abc` becomes `google` and `comabc`, so that major part compresses to `g4e.c4c`.
Implement both parts and analyze time and space complexity.
Quick Answer: This question evaluates string processing, input validation, tokenization, hierarchical string manipulation, and algorithmic complexity analysis skills. It is in the Coding & Algorithms category and tests parsing, data validation, tokenization, grouping logic, and compression within practical programming and algorithm design domains.
Part 1: Validate CSV Rows Have No Empty Trimmed Cells
Given a multiline CSV string, ignore blank lines and use the first non-empty row to define the expected number of columns. Return whether the dataset is valid under these rules: every non-empty row must have the same number of comma-separated fields, and every field must be non-empty after trimming leading and trailing whitespace. Assume fields never contain quoted commas.