PracHub
QuestionsCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates proficiency in robust file parsing, schema-driven data modeling, type conversion, validation, duplicate handling, and implementing aggregate and ranking queries using Python's standard library for messy, delimited payroll data.

  • medium
  • Squarepoint
  • Coding & Algorithms
  • Software Engineer

Parse payroll file and answer queries

Company: Squarepoint

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Take-home Project

You are given a payroll "file" as a `StringIO` (file-like) object in Python. The underlying text represents manually-entered payroll data. Assume the following: - The data is in a **delimited text format** (for example, comma- or tab-separated). - There is sufficient **schema metadata** in the file (for example, a header row with column names and/or type hints) so that a general parser can be written without hardcoding a specific set of columns. - The **first field/column** of each data row is always intended to be a **unique numerical ID** for that employee. - Some columns may contain missing or malformed values. - Because the data is manually entered, there may be **formatting issues** and **duplicate records** (e.g., repeated IDs, partially duplicated rows, or inconsistent fields). - You may use only the Python standard library (no external libraries like Pandas). Your tasks: 1. **Parse the file** - Parse the contents of the given `StringIO` into an in-memory data structure of your choice (e.g., list of dicts, custom class, etc.). - Your parsing logic should: - Use the schema metadata provided in the file (e.g., a header row) so that it can be reused for similar files with different columns. - Be robust to common real-world data issues: extra whitespace, missing fields, inconsistent delimiters, empty lines, and duplicate IDs. - Apply reasonable validation and type conversion for fields (e.g., convert numeric fields to numbers, dates to `datetime` objects) based on available metadata or column names. 2. **Implement query functions** After parsing, implement functions to answer the following questions about the dataset: 1. **Count of higher salaries**: How many people in the dataset have salaries strictly above 30,000? - You may assume there is a column that clearly represents an employee's salary (e.g., named `salary`), but you must handle missing or invalid salary values carefully. 2. **Longest job tenure**: Which person has held their job the longest? - You may assume there is a column that represents the job start date (e.g., `start_date`), but it may be missing or malformed for some rows. - Decide and document how to handle rows with missing/invalid start dates. - Return some reasonable identifier for the employee (e.g., their unique ID or name) and handle ties in a clearly defined way. 3. **Second-highest salary**: Who has the **second-highest** salary? - Again, use the salary column and handle missing/invalid salary values. - Think about how to treat duplicate salaries (e.g., multiple people with the highest salary) and clearly define what your function should return. 3. **Generalization and design considerations** - Design your parsing and query code so that it can handle other payroll files that follow the same metadata convention but may have **different sets of columns**, additional fields, or different column ordering. - Keep the code **maintainable** and **testable**: - Separate parsing logic from business logic (i.e., the query functions). - Write code that would be straightforward to unit test (e.g., pure functions taking parsed data structures as input and returning results without side effects). Describe the approach and data structures you would use, and then implement the parsing and the three query functions according to the above requirements.

Quick Answer: This question evaluates proficiency in robust file parsing, schema-driven data modeling, type conversion, validation, duplicate handling, and implementing aggregate and ranking queries using Python's standard library for messy, delimited payroll data.

You are given a payroll "file" as raw text (the contents of a `StringIO`/file-like object). The text is manually-entered, delimited payroll data with a header row of column names; the first column of every data row is a unique numerical employee ID. The file may be comma- or tab-separated and contains real-world messiness: extra whitespace, empty lines, missing/malformed values, and duplicate IDs. Write a schema-driven parser (no hardcoded column set; read columns from the header) that builds an in-memory record per employee, then answer three queries. Return a result with these keys: 1. `count_above_30000` (int): how many employees have a `salary` strictly greater than 30000. Ignore rows whose salary is missing or non-numeric. 2. `longest_tenure_id` (int or None): the ID of the employee who has held their job the longest, i.e. the earliest valid `start_date`. Rows with a missing/malformed `start_date` are skipped. Break ties by smallest ID. If no row has a valid date, return None. 3. `second_highest_id` (int or None): the ID of an employee holding the second-highest DISTINCT salary. Duplicate top salaries collapse to one distinct value. Break ties (multiple holders of the second-highest salary) by smallest ID. If there are fewer than two distinct salaries, return None. Parsing rules: detect the delimiter (tab if the header contains a tab, else comma); strip whitespace from every cell; skip empty lines and rows with a non-integer first field; when an ID repeats, the last occurrence wins; tolerate `$` and thousands separators (`,`) inside salary cells; accept dates in `%Y-%m-%d`, `%m/%d/%Y`, `%d/%m/%Y`, or `%Y/%m/%d`. Use only the standard library.

Constraints

  • The file may be comma- or tab-separated; the delimiter is detected from the header line (tab if present, else comma).
  • The first column of every data row is a unique numerical employee ID; rows with a non-integer first field are skipped.
  • When an ID repeats, the last occurrence wins (deduplication).
  • Salary cells may be missing, non-numeric, or contain '$'/thousands separators; invalid salaries are ignored by the queries.
  • start_date cells may be missing or malformed; rows without a valid date are excluded from the tenure query. Accepted formats: %Y-%m-%d, %m/%d/%Y, %d/%m/%Y, %Y/%m/%d.
  • Only the Python standard library is used (csv, io, datetime).
  • Ties are broken by smallest employee ID for both the tenure and second-highest-salary queries.

Examples

Input: ("id,name,salary,start_date\n1,Alice,45000,2015-03-01\n2,Bob,28000,2018-07-15\n3,Carol,52000,2012-01-20\n4,Dave,30000,2020-05-05\n5,Eve,35000,2010-11-30",)

Expected Output: {'count_above_30000': 3, 'longest_tenure_id': 5, 'second_highest_id': 1}

Explanation: Salaries > 30000: Alice(45000), Carol(52000), Eve(35000) = 3 (Dave's 30000 is excluded, strictly greater). Earliest start_date is Eve 2010-11-30 -> id 5. Distinct salaries sorted desc: 52000, 45000, ... second-highest is 45000 held by Alice -> id 1.

Input: ("id,name,salary,start_date\n10,Frank,,2016-02-02\n11,Grace,abc,2014-08-08\n12,Heidi,31000,bad-date\n13,Ivan,90000,2009-09-09\n13,Ivan,91000,2009-09-09",)

Expected Output: {'count_above_30000': 2, 'longest_tenure_id': 13, 'second_highest_id': 12}

Explanation: Frank's salary is missing and Grace's is non-numeric, so both are ignored for salary queries. Duplicate id 13 -> last row wins (91000). Salaries > 30000: Heidi(31000) and Ivan(91000) = 2. Only valid dates belong to Frank(2016), Grace(2014) and the kept Ivan row(2009); earliest is Ivan 2009 -> id 13. Distinct salaries: 91000, 31000 -> second-highest 31000 is Heidi -> id 12.

Input: ("",)

Expected Output: {'count_above_30000': 0, 'longest_tenure_id': None, 'second_highest_id': None}

Explanation: Empty input: no header, no rows. All three queries return their empty defaults.

Input: ("emp_id\tfull_name\tsalary\tstart_date\n100\tJ\t30001\t2019-01-01\n\n101\tK\t29999\t2017-06-06\n102\tL\t30000\t2015-03-03",)

Expected Output: {'count_above_30000': 1, 'longest_tenure_id': 102, 'second_highest_id': 102}

Explanation: Tab-delimited with a blank line (skipped). Only J's 30001 is strictly above 30000 (29999 and 30000 are not) = 1. Earliest start_date is L 2015-03-03 -> id 102. Distinct salaries sorted desc: 30001, 30000, 29999 -> second-highest 30000 is held by L -> id 102.

Input: ("id,name,salary,start_date\n7,Solo,60000,2011-04-04",)

Expected Output: {'count_above_30000': 1, 'longest_tenure_id': 7, 'second_highest_id': None}

Explanation: Single employee: one salary above 30000, the only valid date belongs to id 7, and there is only one distinct salary so there is no second-highest -> None.

Hints

  1. Read the header row to learn the column names instead of hardcoding them; look up the 'salary' and 'start_date' columns by name so the same parser works for files with different columns.
  2. Store one record per employee ID in a dict so duplicate IDs naturally collapse to the last-seen row.
  3. For the second-highest salary, build the SET of distinct salaries, sort descending, take index 1, then return the smallest ID holding that salary. Returning the second element of a raw sorted list would mishandle duplicate top salaries.
  4. Validate every numeric/date conversion and skip (don't crash on) values you can't parse — that is the whole point of 'manually-entered' data.
Last updated: Jun 26, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Implement EMA Crossovers and Root Solvers - Squarepoint (medium)
  • Solve two coding assessment tasks - Squarepoint (hard)
  • Maximize revenue from inventory sales - Squarepoint (medium)
  • Maximize profit from one or many stock trades - Squarepoint (hard)