Validate KYC CSV Records
Company: Stripe
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Technical Screen
Quick Answer: This question evaluates skills in CSV parsing, string normalization, field-level validation, and rule-based text matching, emphasizing attention to edge cases such as trimming whitespace, substring checks, and word-overlap logic.
Constraints
- 1 <= total length of `csv_data` <= 10^6
- The first line is the header `col1,col2,col3,col4,col5,col6`
- Fields do not contain embedded commas or quoted newlines
- For overlap checking, split words only by whitespace
- `W2` can be assumed non-empty after removing `llc` and `inc`
Examples
Input: ("col1,col2,col3,col4,col5,col6",)
Expected Output: []
Explanation: There are no data rows, so the result is an empty list.
Input: ("col1,col2,col3,col4,col5,col6\n1,Blue Ocean LLC,x,Ocean Blue,Blue LLC,z\n2,North GROUP Holdings,x,North Holdings,North Holdings,z\n3,Sunrise Inc Bakery,x,Random Name, Sunrise Bakery ,z\n4,Red Apple Bakery Cafe,x,Apple Cafe,ValidFive,z\n5,Green Field Market,x,Green Shop,Field,z",)
Expected Output: ["VERIFIED", "NOT VERIFIED", "VERIFIED", "VERIFIED", "NOT VERIFIED"]
Explanation: Row 1 matches fully with col4. Row 2 contains the forbidden term 'group'. Row 3 matches fully with col5 after trimming and ignoring 'inc'. Row 4 has exactly 50% overlap with col4 (2 of 4 words). Row 5 has only 1 of 3 words overlapping with either col4 or col5.
Input: ("col1,col2,col3,col4,col5,col6\n1,Alpha Beta,x,Alpha Beta,abcde,z\n2,Alpha Beta,x,Alpha Beta, ,z\n3,Alpha Beta,x,Alpha,abcd,z",)
Expected Output: ["VERIFIED", "NOT VERIFIED", "NOT VERIFIED"]
Explanation: Row 1 is valid and uses the lower bound length 5 for col5. Row 2 fails because col5 is empty after trimming. Row 3 fails because col5 has length 4.
Input: ("col1,col2,col3,col4,col5,col6\n1,North Star LLC,x,North Star,1234567890123456789012345678901,z\n2,Fresh Market,x,Fresh,12345",)
Expected Output: ["VERIFIED", "NOT VERIFIED"]
Explanation: Row 1 is valid and uses the upper bound length 31 for col5. Row 2 has only 5 fields instead of 6, so it is not verified.
Hints
- Process one row at a time: split by commas, trim each field, and fail fast as soon as any rule is broken.
- For the 50% overlap rule, keep `col2` as a list of words, but convert the words from `col4` and `col5` to sets for fast membership checks. Use `2 * matches >= len(W2)` to test the threshold without floating-point math.