Coding, Data Structures, And Parsing
Asked of: Data Scientist
Last updated

-
What's being tested
Candidates must show they can parse messy input reliably and choose data structures and algorithms that meet correctness, time, and memory constraints at scale. Interviewers look for clear problem framing, edge-case handling, streaming vs batch trade-offs, and when to use approximations. -
Core knowledge
- Time/space complexity basics: O(n), O(n log n), and when external sort needed.
- Hash maps/sets for counting and deduplication; heaps for top-k queries.
- Streaming algorithms: Count-Min Sketch for frequencies, HyperLogLog for cardinality.
- CSV/JSON pitfalls: quoted fields, newlines inside fields, and newline-delimited JSON (NDJSON).
- Regex vs stateful parsers: regex brittle for nested/quoted formats; use proper parsers.
- Sliding-window techniques for time-series grouping and fixed-memory windows.
- Two-pass and external algorithms: map-reduce, sort-merge, and reservoir sampling for uniform subsamples.
-
Worked example
Framing: "Top-k active users per minute from newline-delimited logs" — first ask input size, memory limits, timestamp format, and whether approximate results are allowed. Decide streaming processing if input > memory: parse each line to extract user ID and minute bucket, maintain hashmap minute→(user→count) and a min-heap per minute for top-k, or use Count-Min plus a heap for heavy hitters to reduce memory. Discuss time-window retention, handling late or out-of-order timestamps, and fall back to external sort or map-reduce for exact global aggregation if streaming state is too large. -
A common pitfall
A tempting but wrong approach is to read all data into in-memory arrays and then parse/sort blindly; this misses scale and edge cases like quoted CSV fields or malformed JSON. Another common mistake is using regex to parse complex nested formats, which fails on escaped characters and embedded delimiters. Also watch for not asking about input size or error handling — those constraints change the whole design. -
Further reading
- Martin Kleppmann, Designing Data-Intensive Applications (practical systems-level trade-offs).
- Flajolet et al., HyperLogLog paper (cardinality estimation at scale).
Related concepts
- Coding Fundamentals For Data Scientists
- Core Data Structures, Sorting, And ComplexityCoding & Algorithms
- Arrays, Strings, Hash Maps, And Frequency CountingCoding & Algorithms
- Coding Fundamentals and Complexity for DS
- Core Data Structures, Algorithms, And ComplexityCoding & Algorithms
- String Parsing, Palindromes, And NormalizationCoding & Algorithms