Streaming, Large Inputs, And External Memory

What's being tested

Interviewers are probing whether you can design memory-bounded algorithms when the input is too large to fit in RAM, a common reality in search, logs, indexing, storage systems, and distributed services. Strong answers distinguish between streaming, batch chunking, external-memory, and distributed approaches instead of defaulting to “load everything into a hash map.” Google cares because many SWE problems are constrained by I/O, latency, fault tolerance, and correctness under scale, not just asymptotic CPU time. The interviewer is looking for clear assumptions, correct complexity analysis, and the ability to trade accuracy, memory, disk, and parallelism deliberately.

Core knowledge

Streaming algorithms process each record once or a small number of times, usually with $O(1)$ , $O(\log n)$ , or bounded $O(k)$ memory. They are ideal for max/min, counts, checksums, online averages, reservoir sampling, and approximate frequency or quantile summaries.
Chunked processing is the default upgrade when input exceeds memory but can be scanned sequentially. Read fixed-size buffers, parse records across chunk boundaries carefully, compute partial results, then merge. For max value, each worker can compute a local max and reduce to a global max.
External sorting handles problems that need global ordering but exceed RAM. Sort chunks in memory, spill sorted runs to disk, then perform a k-way merge using a min-heap of size $k$ . Runtime is dominated by sequential disk I/O, not CPU comparison cost.
Hash partitioning is a powerful pattern for set intersection across huge files. Partition both files by hash(object_id) % P, write corresponding partitions to disk, then process each pair independently in memory. Correctness depends on using the same deterministic hash and preserving full records or IDs.
Bloom filters provide approximate set membership with false positives but no false negatives. For $m$ bits, $n$ inserted items, and $k$ hash functions, false-positive probability is approximately $(1 - e^{-kn/m})^k$ . They are useful to prefilter candidates before exact verification.
Two-heaps median maintains an exact online median when all values fit conceptually in memory: max-heap for lower half, min-heap for upper half, rebalance so sizes differ by at most one. Updates cost $O(\log n)$ and space is $O(n)$ , which fails for unbounded streams.
Approximate quantile summaries solve memory-constrained median at scale. Algorithms like Greenwald-Khanna, t-digest, and KLL sketches trade accuracy for bounded memory. A typical guarantee is rank error: returned value has rank within $\epsilon n$ of the true median.
Reservoir sampling selects a uniform sample of size $k$ from an unknown-length stream using $O(k)$ memory. The $i$ th item replaces a random existing sample with probability $k/i$ . It can estimate medians but gives probabilistic error rather than deterministic rank bounds.
I/O complexity matters for huge files. A theoretically $O(n)$ algorithm with random disk seeks can be slower than an $O(n \log n)$ algorithm using sequential reads. Prefer sequential scans, buffered reads, compression-aware parsing, and minimizing intermediate materialization.
Parallel reduction works when the operation is associative and commutative, such as max, min, sum, count, bitwise OR, or set union. Median is not directly reducible from local medians; you need mergeable summaries, histograms, selection algorithms, or distributed quantile sketches.
Record boundary handling is a frequent edge case. Huge text files may split a number, JSON object, or log line across buffers. A robust solution keeps a carryover buffer, validates malformed records, defines overflow behavior, and handles empty files explicitly.
Exact versus approximate must be stated early. Exact shared-object detection may require disk partitioning or sorting; approximate detection can use Bloom filters. Exact median may require multiple passes or external storage; approximate median can be one-pass with bounded memory.

Worked example

For Find maximum value in huge text file, a strong candidate first clarifies the format: one integer per line or arbitrary delimiters, signed values, possible malformed records, file size, local disk versus distributed storage, and whether the answer must be exact. They would then declare the key assumption: the file is far larger than memory, but a sequential scan is possible. The answer skeleton is: stream the file in large buffers, parse numbers safely across chunk boundaries, maintain a running maximum, and optionally parallelize by splitting the file into byte ranges aligned to record boundaries. If using multiple workers, each worker computes a local maximum, and a final reducer computes the global maximum, which is valid because max is associative and commutative. Complexity is $O(n)$ time over bytes or records, $O(1)$ memory per worker excluding the input buffer, and one sequential pass over the file. A good tradeoff to flag is buffer size: larger buffers reduce syscall overhead but increase memory footprint and may complicate latency or worker scheduling. They should also mention edge cases: empty file, all negative numbers, integer overflow, partial final line, corrupted input, and encoding. A strong close would be: “If I had more time, I’d add checksums or byte-offset logging for restartability, benchmark buffer sizes, and compare single-machine scan versus distributed scan based on storage bandwidth.”

A second angle

For Find shared objects across two log files, the same large-input principles apply, but a single running variable is not enough because the problem is about set membership. If one file fits in memory, build a hash set of its object IDs and stream the other file, outputting matches in $O(n + m)$ time. If neither fits, use hash partitioning: split both files into the same $P$ partitions by object ID, then load and intersect one partition pair at a time. If duplicate IDs matter, clarify whether the output is unique shared objects or multiplicities; that changes the data structure from a set to counts. If approximation is allowed, build a Bloom filter from one file and stream the other, then optionally verify candidates to remove false positives.

Common pitfalls

Pitfall: Treating “huge input” as only a Big-O problem.

Saying “use a hash map, it’s $O(n)$ ” misses the central constraint: memory. A better answer explicitly calculates whether the structure fits, then proposes streaming, partitioning, external sorting, or approximate sketches when it does not.

Pitfall: Ignoring exactness requirements.

For median or shared IDs, approximate methods like sampling, Bloom filters, and sketches may be excellent, but only if false positives or rank error are acceptable. A strong candidate states the guarantee: exact output, no false negatives, bounded rank error, or probabilistic confidence.

Pitfall: Skipping parsing and boundary correctness.

Many candidates describe the high-level algorithm but forget that chunks can split records. Interviewers often push on this because production failures come from partial lines, malformed records, overflow, duplicate handling, and inconsistent delimiters.

Connections

This topic often pivots into distributed systems, especially map-reduce style aggregation, sharding, retries, and idempotent partial results. It also connects to probabilistic data structures such as Bloom filters, Count-Min Sketch, HyperLogLog, and quantile sketches, plus classic systems performance topics like buffering, disk seeks, compression, and backpressure.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts