Coding Fundamentals For Data Scientists

What's being tested

Meta Data Scientist coding interviews test whether you can turn ambiguous data-shaped problems into correct, efficient, readable code under time pressure. The interviewer is usually probing for fluency with Python fundamentals, data structures, algorithmic complexity, and edge-case handling—not obscure competitive-programming tricks. For DS roles, they also care that your code resembles production analysis code: clear variable names, robust handling of missing or malformed inputs, and solutions that scale from toy examples to millions of rows. Strong candidates explain tradeoffs as they code, because at Meta the same logic may later be translated into pandas, SQL, Spark, or a production metric pipeline.

Core knowledge

Know the core Python containers and their complexity. list indexing and append are $O(1)$ amortized, membership is $O(n)$ ; dict and set lookup/insert/delete are $O(1)$ average but $O(n)$ worst-case under hash collisions; deque.popleft() is $O(1)$ unlike list.pop(0).
Use hash maps for counting, grouping, and deduplication. collections.Counter, defaultdict(int), and defaultdict(list) solve many DS-style questions cleanly. For $N \leq 10^7$ simple Python counting is often fine; beyond that, consider streaming sketches, external sort, SQL aggregation, or Spark.
Sorting is often the simplest acceptable baseline. Python’s Timsort is stable and $O(n \log n)$ worst-case, but close to $O(n)$ on partially sorted data. Sorting is useful for interval merging, rank computation, deduplication, two-pointer scans, and deterministic tie-breaking.
For top- $k$ problems, compare full sort vs heap vs selection. Sorting counts costs $O(m \log m)$ for $m$ unique keys; a min-heap of size $k$ costs $O(m \log k)$ ; Quickselect averages $O(m)$ but is harder to implement correctly and has poor worst-case without randomized pivots.
Be fluent with two-pointer and sliding-window patterns. They work when the problem has monotonic movement: sorted arrays, substring windows, cumulative constraints, or pair sums. Typical complexity is $O(n)$ after sorting, with edge cases around empty inputs, duplicates, negative values, and window shrink conditions.
Separate correctness from performance. First state a brute-force solution, such as checking all pairs in $O(n^2)$ , then optimize using a set, sorting, prefix sums, or heap. Interviewers reward the ability to articulate why an optimization is valid, not just jumping to code.
Prefix sums make range queries and cumulative metrics cheap. Define $P[0]=0$ , $P[i+1]=P[i]+x_i$ , so sum(l, r) for half-open interval $[l,r)$ is $P[r]-P[l]$ . This is common for rolling metrics, retention windows, and cumulative event counts.
Be careful with floating-point arithmetic. Python float is IEEE 754 double precision; comparisons like a == b can fail after repeated operations. Use tolerances such as abs(a-b) < 1e-9, or decimal.Decimal for currency-like calculations where exact base-10 representation matters.
Treat missingness and malformed data as first-class cases. Decide how to handle None, empty strings, negative timestamps, duplicate user IDs, unsorted records, and timezone boundaries. In DS work, “bad rows” are common; robust code either filters, imputes, raises, or logs explicitly.
Write test cases before or during coding. Include nominal cases, empty input, single element, all duplicates, no match, ties, large values, and boundary sizes. A strong Meta answer often includes two or three quick examples run mentally after the implementation.
Prefer readable Python over clever one-liners. Counter(events).most_common(k) may be acceptable if the point is counting, but be ready to implement the underlying logic with a dictionary and heap. Explain library behavior if you use it, especially tie-breaking and complexity.
Recognize when pandas is appropriate but not sufficient. groupby, merge, sort_values, and vectorized NumPy operations are excellent for analysis, but interviews often expect plain Python. If using pandas conceptually, understand the equivalent hash join, aggregation, or sort-based operation underneath.

Worked example

For Top K Frequent Elements, a strong candidate would start by clarifying the input and output contract: “Are elements strings or integers, can the input be empty, should ties be ordered deterministically, and is $k$ always valid or can it exceed the number of unique elements?” Then they would declare a reasonable assumption, such as returning any order unless a tie-break rule is specified, and treating $k >$ unique count as returning all unique elements. The answer skeleton should have four pillars: count frequencies with a hash map, choose the top- $k$ selection strategy, analyze complexity, and test edge cases. A simple implementation uses Counter(nums) followed by heapq.nlargest(k, counts.items(), key=lambda x: x[1]), which is $O(n + m \log k)$ for $n$ total elements and $m$ unique elements. If the interviewer disallows libraries, the candidate can manually build the dictionary and maintain a min-heap of (frequency, value) pairs. The main tradeoff to flag is that full sorting is shorter and more deterministic at $O(m \log m)$ , while a heap is better when $k \ll m$ , such as top 100 items among 10 million unique IDs. The candidate should explicitly handle k == 0, empty input, and ties, because product metrics often require stable reproducibility across runs. They would close by saying, “If I had more time, I’d add deterministic tie-breaking, memory profiling for very high cardinality keys, and possibly a streaming approximation like Count-Min Sketch if the data cannot fit in memory.”

A second angle

For Merge Overlapping Intervals, the same fundamentals appear, but the core structure shifts from hashing to sorting and invariant maintenance. A good candidate clarifies whether intervals are closed [start, end] or half-open [start, end), whether touching intervals like [1,3] and [3,5] should merge, and whether input intervals can be invalid or unsorted. The typical solution sorts by start time, then scans once while maintaining the current merged interval, producing $O(n \log n)$ time and $O(n)$ output space. The important invariant is: after processing the first $i$ intervals, the result list contains non-overlapping intervals sorted by start, and only the last interval can possibly overlap the next one. Compared with top- $k$ , the candidate’s communication should focus less on data structure choice and more on boundary semantics, because a one-character decision like <= versus < changes the answer.

Common pitfalls

Analytical mistake: Jumping to the first working solution without complexity awareness. For example, checking every element against every other element for duplicates is tempting and simple, but $O(n^2)$ becomes unusable around hundreds of thousands of rows. A stronger answer says, “The brute force is quadratic; I’ll use a set to reduce membership checks to expected $O(1)$ .”

Communication mistake: Coding silently and leaving the interviewer to infer assumptions. Saying nothing about ties in a top- $k$ problem, or about inclusive boundaries in an interval problem, makes even correct code look fragile. State assumptions early, then code to those assumptions: “I’ll treat intervals as closed and merge if next_start <= current_end.”

Depth mistake: Overusing built-ins without understanding them. Counter(...).most_common(k) may pass a small test, but if asked about scalability, ties, or streaming data, shallow answers fall apart. It is better to say, “I’ll use Counter for brevity, but underneath this is a hash map; for large $m$ and small $k$ , I’d pair it with a heap.”

Connections

Interviewers may pivot from coding fundamentals into SQL aggregation, pandas data manipulation, experimentation metrics, or production data pipeline design. If your solution involves counting, ranking, or joining, expect follow-ups about high-cardinality data, approximate algorithms, distributed execution in Spark/Presto, or metric correctness under duplicated events.