Implement Validation and String Compression
Company: Stripe
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: hard
Interview Round: Onsite
Quick Answer: This question evaluates string processing, input validation, tokenization, hierarchical string manipulation, and algorithmic complexity analysis skills. It is in the Coding & Algorithms category and tests parsing, data validation, tokenization, grouping logic, and compression within practical programming and algorithm design domains.
Part 1: Validate CSV Rows Have No Empty Trimmed Cells
Constraints
- 0 <= len(csv_data) <= 10^6
- Rows do not contain quoted commas
- Blank lines, including lines containing only whitespace, must be ignored
- A trailing comma creates an empty final cell and therefore makes the dataset invalid
Examples
Input: ('name,age\nalice,30\nbob,25',)
Expected Output: True
Explanation: All rows have 2 columns, and no trimmed cell is empty.
Input: ('name,age\nalice, \nbob,25',)
Expected Output: False
Explanation: The second row contains a cell that becomes empty after trimming spaces.
Input: ('a,b,c\n1,2\n3,4,5',)
Expected Output: False
Explanation: The second non-empty row has fewer columns than the first one.
Input: (' \n\n',)
Expected Output: True
Explanation: There are no non-empty rows, so nothing violates the rules.
Hints
- Use the first non-empty row to lock in the expected column count.
- After splitting by commas, trim each field before checking whether it is empty.
Part 2: Count CSV Rows Accepted After Banned-Word Filtering
Constraints
- 0 <= len(csv_data) <= 10^6
- 0 <= number of banned words <= 10^5
- Rows do not contain quoted commas
- The first non-empty row establishes the expected column count for all later non-empty rows
Examples
Input: ('a,b\nbad,x\nc,d', ['bad'])
Expected Output: 2
Explanation: The middle row is rejected because one cell matches a banned word.
Input: ('ok,no\nBaD,yes\nfine,BAD', ['bad'])
Expected Output: 1
Explanation: Banned-word matching is case-insensitive, so only the first row is accepted.
Input: ('a,b\n1,\n2,3,4\n4,5', [])
Expected Output: 2
Explanation: The second row has an empty trimmed cell, the third row has the wrong column count, and the first and fourth rows are accepted.
Input: (' \n\n', ['x'])
Expected Output: 0
Explanation: Blank lines are ignored, so there are no accepted rows.
Hints
- Normalize banned words to lowercase once before scanning rows.
- Treat malformed rows and rows with empty trimmed cells as rejected instead of raising errors.
Part 3: Count Stop-Word Tokens in Two CSV Columns
Constraints
- 0 <= len(csv_data) <= 10^6
- 0 <= c1, c2 <= 10^9
- Rows do not contain quoted commas
- A token is any maximal substring matching [A-Za-z]+
Examples
Input: ('Alpha and beta,The dog,ok\nhello,or but,bad\nA-cat,or;the,ok', ['bad'], 0, 1, ['the', 'and', 'or'])
Expected Output: 4
Explanation: The second row is rejected by the banned-word rule. The remaining rows contribute tokens and, the, or, and the.
Input: ('a and,b and\nor,the', [], 1, 1, ['and', 'the'])
Expected Output: 2
Explanation: Because c1 and c2 are the same, column 1 is counted once per row.
Input: ('x,y\n1,\n2,3,4\nand,the', [], 0, 1, ['and', 'the'])
Expected Output: 2
Explanation: The second row has an empty cell, the third row has the wrong number of columns, and only the last row contributes stop-word tokens.
Input: (' \n\n', ['bad'], 0, 1, ['a'])
Expected Output: 0
Explanation: There are no accepted rows.
Hints
- A regex like [A-Za-z]+ can extract exactly the tokens you need.
- Use a set for the target columns so that c1 == c2 does not double count the same cell.
Part 4: Compress a Hierarchical String Token by Token
Constraints
- 0 <= len(s) <= 10^6
- Only '/' separates major parts and '.' separates minor parts
- Empty parts caused by repeated or trailing separators must be ignored
- Tokens may contain any characters other than '/' and '.'
Examples
Input: ('abcd/erfgsh/google.com.abc.',)
Expected Output: 'a2d/e4h/g4e.c1m.a1c'
Explanation: Each token is compressed independently while preserving '/' and '.' structure.
Input: ('//ab..cde./f/',)
Expected Output: 'ab.c1e/f'
Explanation: Empty major and minor parts are ignored.
Input: ('',)
Expected Output: ''
Explanation: An empty input contains no tokens, so the result is an empty string.
Input: ('hi',)
Expected Output: 'hi'
Explanation: Tokens of length 2 or less stay unchanged.
Hints
- Split first by '/', then split each major part by '.'.
- Filter out empty strings before calling compress on each token.
Part 5: Compress a Hierarchical String After Merging Minor Parts
Constraints
- 0 <= len(s) <= 10^6
- 1 <= m <= 10^6
- Empty parts caused by repeated or trailing separators must be ignored
- If a major part has at most m minor parts, no merging is performed for that major part
Examples
Input: ('abcd/erfgsh/google.com.abc.', 2)
Expected Output: 'a2d/e4h/g4e.c4c'
Explanation: The third major part has 3 minor parts, so it becomes google and comabc before compression.
Input: ('alpha.beta/gamma', 1)
Expected Output: 'a7a/g3a'
Explanation: With m = 1, alpha and beta merge into alphabeta, which compresses to a7a.
Input: ('//ab..cde./f.g', 3)
Expected Output: 'ab.c1e/f.g'
Explanation: No major part has more than 3 minor parts, so this behaves like normal token-by-token compression.
Input: ('one.two.three.four', 2)
Expected Output: 'o1e.t10r'
Explanation: The first group stays one, and the remaining parts merge into twothreefour before compression.
Input: ('///..', 2)
Expected Output: ''
Explanation: All parts are empty and therefore ignored.
Hints
- Handle m = 1 carefully: all minor parts of a major part collapse into one concatenated group.
- Do the grouping first, then compress each resulting group.