You are given a payroll "file" as a StringIO (file-like) object in Python. The underlying text represents manually-entered payroll data.
Assume the following:
-
The data is in a
delimited text format
(for example, comma- or tab-separated).
-
There is sufficient
schema metadata
in the file (for example, a header row with column names and/or type hints) so that a general parser can be written without hardcoding a specific set of columns.
-
The
first field/column
of each data row is always intended to be a
unique numerical ID
for that employee.
-
Some columns may contain missing or malformed values.
-
Because the data is manually entered, there may be
formatting issues
and
duplicate records
(e.g., repeated IDs, partially duplicated rows, or inconsistent fields).
-
You may use only the Python standard library (no external libraries like Pandas).
Your tasks:
-
Parse the file
-
Parse the contents of the given
StringIO
into an in-memory data structure of your choice (e.g., list of dicts, custom class, etc.).
-
Your parsing logic should:
-
Use the schema metadata provided in the file (e.g., a header row) so that it can be reused for similar files with different columns.
-
Be robust to common real-world data issues: extra whitespace, missing fields, inconsistent delimiters, empty lines, and duplicate IDs.
-
Apply reasonable validation and type conversion for fields (e.g., convert numeric fields to numbers, dates to
datetime
objects) based on available metadata or column names.
-
Implement query functions
After parsing, implement functions to answer the following questions about the dataset:
-
Count of higher salaries
: How many people in the dataset have salaries strictly above 30,000?
-
You may assume there is a column that clearly represents an employee's salary (e.g., named
salary
), but you must handle missing or invalid salary values carefully.
-
Longest job tenure
: Which person has held their job the longest?
-
You may assume there is a column that represents the job start date (e.g.,
start_date
), but it may be missing or malformed for some rows.
-
Decide and document how to handle rows with missing/invalid start dates.
-
Return some reasonable identifier for the employee (e.g., their unique ID or name) and handle ties in a clearly defined way.
-
Second-highest salary
: Who has the
second-highest
salary?
-
Again, use the salary column and handle missing/invalid salary values.
-
Think about how to treat duplicate salaries (e.g., multiple people with the highest salary) and clearly define what your function should return.
-
Generalization and design considerations
-
Design your parsing and query code so that it can handle other payroll files that follow the same metadata convention but may have
different sets of columns
, additional fields, or different column ordering.
-
Keep the code
maintainable
and
testable
:
-
Separate parsing logic from business logic (i.e., the query functions).
-
Write code that would be straightforward to unit test (e.g., pure functions taking parsed data structures as input and returning results without side effects).
Describe the approach and data structures you would use, and then implement the parsing and the three query functions according to the above requirements.