PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates string processing, input validation, tokenization, hierarchical string manipulation, and algorithmic complexity analysis skills. It is in the Coding & Algorithms category and tests parsing, data validation, tokenization, grouping logic, and compression within practical programming and algorithm design domains.

  • hard
  • Stripe
  • Coding & Algorithms
  • Software Engineer

Implement Validation and String Compression

Company: Stripe

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: hard

Interview Round: Onsite

Implement the following two coding tasks. 1. **CSV dataset validation** You are given: - a multiline string representing CSV data; - a set of banned words; - two column indexes `c1` and `c2`; - a set of stop words. Assume commas are field separators, rows do not contain quoted commas, and every non-empty row is expected to have the same number of columns. Write code that processes the input line by line and supports three parts: - **Part 1:** validate that no cell is empty after trimming whitespace. - **Part 2:** reject any row containing a banned word in any column, using case-insensitive matching. - **Part 3:** for columns `c1` and `c2`, tokenize text using non-letter characters as delimiters and count how many tokens are stop words, case-insensitive, across all valid rows. Discuss how you handle blank lines, malformed rows, trailing commas, and large inputs. 2. **Hierarchical string compression** You are given a string whose major parts are separated by `/` and whose minor parts inside each major part are separated by `.`. Ignore empty parts created by repeated or trailing separators. Define `compress(token)` as: - if `token.length() <= 2`, keep it unchanged; - otherwise return `first_character + (token.length() - 2) + last_character`. Implement two parts: - **Part 1:** compress every token independently and rebuild the string with the same hierarchy. - **Part 2:** you are also given an integer `m`. If a major part contains more than `m` minor parts, reduce it to exactly `m` groups by keeping the first `m - 1` minor parts as separate groups and concatenating all remaining minor parts into the final group before compressing. Example: - Input: `abcd/erfgsh/google.com.abc.` - Part 1 output: `a2d/e4h/g4e.c1m.a1c` - If `m = 2`, then `google.com.abc` becomes `google` and `comabc`, so that major part compresses to `g4e.c4c`. Implement both parts and analyze time and space complexity.

Quick Answer: This question evaluates string processing, input validation, tokenization, hierarchical string manipulation, and algorithmic complexity analysis skills. It is in the Coding & Algorithms category and tests parsing, data validation, tokenization, grouping logic, and compression within practical programming and algorithm design domains.

Part 1: Validate CSV Rows Have No Empty Trimmed Cells

Given a multiline CSV string, ignore blank lines and use the first non-empty row to define the expected number of columns. Return whether the dataset is valid under these rules: every non-empty row must have the same number of comma-separated fields, and every field must be non-empty after trimming leading and trailing whitespace. Assume fields never contain quoted commas.

Constraints

  • 0 <= len(csv_data) <= 10^6
  • Rows do not contain quoted commas
  • Blank lines, including lines containing only whitespace, must be ignored
  • A trailing comma creates an empty final cell and therefore makes the dataset invalid

Examples

Input: ('name,age\nalice,30\nbob,25',)

Expected Output: True

Explanation: All rows have 2 columns, and no trimmed cell is empty.

Input: ('name,age\nalice, \nbob,25',)

Expected Output: False

Explanation: The second row contains a cell that becomes empty after trimming spaces.

Input: ('a,b,c\n1,2\n3,4,5',)

Expected Output: False

Explanation: The second non-empty row has fewer columns than the first one.

Input: (' \n\n',)

Expected Output: True

Explanation: There are no non-empty rows, so nothing violates the rules.

Hints

  1. Use the first non-empty row to lock in the expected column count.
  2. After splitting by commas, trim each field before checking whether it is empty.

Part 2: Count CSV Rows Accepted After Banned-Word Filtering

Given a multiline CSV string and a collection of banned words, count how many rows are accepted. Ignore blank lines. The first non-empty row defines the expected number of columns. A row is accepted only if it has that column count, every trimmed cell is non-empty, and no trimmed cell exactly matches any banned word ignoring case. Rows that fail any rule are rejected and not counted.

Constraints

  • 0 <= len(csv_data) <= 10^6
  • 0 <= number of banned words <= 10^5
  • Rows do not contain quoted commas
  • The first non-empty row establishes the expected column count for all later non-empty rows

Examples

Input: ('a,b\nbad,x\nc,d', ['bad'])

Expected Output: 2

Explanation: The middle row is rejected because one cell matches a banned word.

Input: ('ok,no\nBaD,yes\nfine,BAD', ['bad'])

Expected Output: 1

Explanation: Banned-word matching is case-insensitive, so only the first row is accepted.

Input: ('a,b\n1,\n2,3,4\n4,5', [])

Expected Output: 2

Explanation: The second row has an empty trimmed cell, the third row has the wrong column count, and the first and fourth rows are accepted.

Input: (' \n\n', ['x'])

Expected Output: 0

Explanation: Blank lines are ignored, so there are no accepted rows.

Hints

  1. Normalize banned words to lowercase once before scanning rows.
  2. Treat malformed rows and rows with empty trimmed cells as rejected instead of raising errors.

Part 3: Count Stop-Word Tokens in Two CSV Columns

Given a multiline CSV string, a collection of banned words, two 0-based column indexes c1 and c2, and a collection of stop words, count how many stop-word tokens appear across all accepted rows. A row is accepted only if it passes the same checks as in Part 2: ignore blank lines, use the first non-empty row to define the expected number of columns, reject rows with a different column count, reject rows with any empty trimmed cell, and reject rows where any trimmed cell exactly matches a banned word ignoring case. For accepted rows, inspect columns c1 and c2, tokenize each chosen cell by splitting on non-letter characters, and count how many tokens are stop words ignoring case. If c1 == c2, count that column only once. Ignore target indexes that are outside the row.

Constraints

  • 0 <= len(csv_data) <= 10^6
  • 0 <= c1, c2 <= 10^9
  • Rows do not contain quoted commas
  • A token is any maximal substring matching [A-Za-z]+

Examples

Input: ('Alpha and beta,The dog,ok\nhello,or but,bad\nA-cat,or;the,ok', ['bad'], 0, 1, ['the', 'and', 'or'])

Expected Output: 4

Explanation: The second row is rejected by the banned-word rule. The remaining rows contribute tokens and, the, or, and the.

Input: ('a and,b and\nor,the', [], 1, 1, ['and', 'the'])

Expected Output: 2

Explanation: Because c1 and c2 are the same, column 1 is counted once per row.

Input: ('x,y\n1,\n2,3,4\nand,the', [], 0, 1, ['and', 'the'])

Expected Output: 2

Explanation: The second row has an empty cell, the third row has the wrong number of columns, and only the last row contributes stop-word tokens.

Input: (' \n\n', ['bad'], 0, 1, ['a'])

Expected Output: 0

Explanation: There are no accepted rows.

Hints

  1. A regex like [A-Za-z]+ can extract exactly the tokens you need.
  2. Use a set for the target columns so that c1 == c2 does not double count the same cell.

Part 4: Compress a Hierarchical String Token by Token

You are given a string whose major parts are separated by '/' and whose minor parts inside each major part are separated by '.'. Ignore empty parts created by repeated separators or trailing separators. Define compress(token) as keeping the token unchanged when its length is at most 2, otherwise replacing it with first character + (length - 2) + last character. Compress every token independently and rebuild the string with the same hierarchy.

Constraints

  • 0 <= len(s) <= 10^6
  • Only '/' separates major parts and '.' separates minor parts
  • Empty parts caused by repeated or trailing separators must be ignored
  • Tokens may contain any characters other than '/' and '.'

Examples

Input: ('abcd/erfgsh/google.com.abc.',)

Expected Output: 'a2d/e4h/g4e.c1m.a1c'

Explanation: Each token is compressed independently while preserving '/' and '.' structure.

Input: ('//ab..cde./f/',)

Expected Output: 'ab.c1e/f'

Explanation: Empty major and minor parts are ignored.

Input: ('',)

Expected Output: ''

Explanation: An empty input contains no tokens, so the result is an empty string.

Input: ('hi',)

Expected Output: 'hi'

Explanation: Tokens of length 2 or less stay unchanged.

Hints

  1. Split first by '/', then split each major part by '.'.
  2. Filter out empty strings before calling compress on each token.

Part 5: Compress a Hierarchical String After Merging Minor Parts

You are given the same hierarchical string as in Part 4 and an integer m. Ignore empty parts created by repeated separators or trailing separators. For each major part, first collect its non-empty minor parts. If the major part has more than m minor parts, reduce it to exactly m groups by keeping the first m - 1 minor parts as separate groups and concatenating all remaining minor parts into the final group. Then apply the same token compression rule as in Part 4 to each resulting group and rebuild the full string.

Constraints

  • 0 <= len(s) <= 10^6
  • 1 <= m <= 10^6
  • Empty parts caused by repeated or trailing separators must be ignored
  • If a major part has at most m minor parts, no merging is performed for that major part

Examples

Input: ('abcd/erfgsh/google.com.abc.', 2)

Expected Output: 'a2d/e4h/g4e.c4c'

Explanation: The third major part has 3 minor parts, so it becomes google and comabc before compression.

Input: ('alpha.beta/gamma', 1)

Expected Output: 'a7a/g3a'

Explanation: With m = 1, alpha and beta merge into alphabeta, which compresses to a7a.

Input: ('//ab..cde./f.g', 3)

Expected Output: 'ab.c1e/f.g'

Explanation: No major part has more than 3 minor parts, so this behaves like normal token-by-token compression.

Input: ('one.two.three.four', 2)

Expected Output: 'o1e.t10r'

Explanation: The first group stays one, and the remaining parts merge into twothreefour before compression.

Input: ('///..', 2)

Expected Output: ''

Explanation: All parts are empty and therefore ignored.

Hints

  1. Handle m = 1 carefully: all minor parts of a major part collapse into one concatenated group.
  2. Do the grouping first, then compress each resulting group.
Last updated: Apr 30, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Assign Reviewers from Changed Files - Stripe (medium)
  • Generate Account Email Notifications - Stripe (medium)
  • Calculate Transaction Fees - Stripe (medium)
  • Build an Account Transfer Ledger - Stripe (medium)
  • Compute transaction fees from a CSV string - Stripe (hard)