Optimize with a cache/hash map
Company: Amazon
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Technical Screen
##### Question
This is a follow-up to your existing implementation (for example, a script that repeatedly resolves file paths or parses license strings into computed results). Propose and analyze an optimization that adds an in-memory cache backed by a dictionary/hash map to speed up repeated lookups. Address each part:
1. **Where to cache.** Identify the spot in your code where caching or a hash map would actually help — i.e. a pure, repeated, deterministic computation whose inputs recur. Justify why that call is a good candidate and why other calls are not.
2. **Keys and values.** Specify the cache key (what uniquely identifies a computation, e.g. the file path or the normalized license string) and the cached value (the computed result). Note any normalization needed so logically-equal inputs map to the same key.
3. **Expected hit rate.** Estimate the expected cache hit rate given the access pattern, and explain what assumptions drive it (input cardinality vs. number of lookups, temporal locality).
4. **Eviction policy.** Choose and justify an eviction policy (LRU / LFU / TTL, or a bounded `dict` with no eviction) based on working-set size and memory budget.
5. **Invalidation / staleness.** Describe how you keep cached values correct when the underlying source changes — e.g. invalidating an entry when a file's `mtime` or content hash changes, or expiring via TTL — and the consistency vs. staleness trade-off you are accepting.
6. **Concurrency safety.** Discuss what happens if multiple threads or processes run concurrently: races on get/put, the thundering-herd / cache-stampede problem on a miss, and how you'd make it safe (locking, per-key locks, or a process-local cache).
7. **Pseudocode.** Provide `get`/`put` pseudocode that includes the validation/invalidation check on read.
8. **Complexity.** State the time and space complexity of the relevant operation before and after adding the cache.
Quick Answer: Amazon software-engineer technical-screen coding follow-up: take an existing script (e.g. one that resolves file paths or parses license strings) and design a hash-map-backed in-memory cache to speed up repeated lookups. It tests choosing a cacheable computation, defining normalized keys and values, estimating hit rate, picking an eviction policy (LRU/LFU/TTL), invalidating on file mtime/content-hash changes, ensuring concurrency safety, and analyzing time and space complexity before and after caching.
Solution
**1. Where to cache.** A cache only helps when a computation is (a) repeated with recurring inputs, (b) deterministic (same input → same output), and (c) more expensive than the lookup itself. In a script that resolves file paths or parses/validates license strings, the parse-and-compute step is the prime candidate: the same path or license string is often processed many times, and parsing is pure given the input. Calls that are cheap, side-effecting, or have unique inputs every time (e.g. timestamps, random IDs) are poor candidates — caching them just wastes memory and adds staleness risk.
**2. Keys and values.**
- *File-path case:* key = the canonical/absolute path (resolve symlinks and `..`, normalize separators) so two spellings of the same file collapse to one entry; value = the computed result. If the result depends on file contents, fold the file's content hash or `(mtime, size)` into the key or store it alongside the value for validation.
- *License-string case:* key = the normalized license string (trim, lowercase or canonicalize, strip insignificant whitespace) so logically-equal strings hit the same entry; value = the parsed/validated result object.
Normalization matters: if you key on the raw string, `"MIT"` and `" mit "` miss each other and tank your hit rate.
**3. Expected hit rate.** Hit rate ≈ 1 − (distinct inputs / total lookups), adjusted for working-set fit and temporal locality. If you do N lookups over D distinct inputs and the cache holds the whole working set, steady-state hit rate ≈ (N − D) / N. So 10,000 lookups over 200 distinct paths → ~98% hits once warm. Heavy skew (a few hot keys dominate) pushes it higher; near-unique inputs push it toward 0% and mean you shouldn't cache.
**4. Eviction policy.**
- If the working set is small and bounded (e.g. a fixed set of files), an unbounded `dict` is simplest and correct.
- If memory is constrained and access has recency locality, **LRU** (e.g. `functools.lru_cache(maxsize=...)` or an `OrderedDict`) is the default good choice.
- If some keys are persistently hot regardless of recency, **LFU** can beat LRU.
- If correctness is time-bounded rather than memory-bounded (data goes stale after a while), use **TTL** expiry, optionally combined with LRU for the size bound.
Pick based on working-set size vs. memory budget; state the `maxsize` you'd choose and why.
**5. Invalidation / staleness.** Caching a derived value is only safe while the source is unchanged.
- *Files:* store the `(mtime, size)` or content hash captured at fill time; on every `get`, re-stat the file and treat the entry as a miss if it changed. `mtime`/size is cheap but can miss same-second edits; a content hash is exact but costs a read. Choose per the correctness requirement.
- *Time-bounded data:* a TTL bounds staleness to the TTL window without touching the source on each read.
The trade-off is explicit: validating on every read (re-stat/re-hash) maximizes freshness at the cost of an I/O per hit; TTL/no-revalidation maximizes speed at the cost of serving stale results for a bounded window. Name which side you're choosing and why.
**6. Concurrency safety.**
- *Single process, multiple threads:* a plain `dict` get/put isn't atomic across the check-then-compute-then-store sequence, so two threads can both miss and both compute. Guard with a lock; for throughput use per-key locks so unrelated keys don't serialize. This also addresses **cache stampede** (thundering herd): on a cold miss for a hot key, have the first thread compute while others wait on that key's lock and then read the filled value.
- *Multiple processes:* a process-local `dict` isn't shared — each process warms its own cache (fine, just lower aggregate hit rate), or you move to a shared store (e.g. Redis/memcached) with atomic ops and the same invalidation rules. Note that a process-local cache also sidesteps cross-process consistency entirely.
**7. Pseudocode (get/put with validation):**
```python
from collections import OrderedDict
import os, threading
class FileResultCache:
def __init__(self, maxsize=1024):
self._d = OrderedDict() # key -> (stamp, value)
self._maxsize = maxsize
self._lock = threading.Lock()
def _current_stamp(self, path):
st = os.stat(path)
return (st.st_mtime_ns, st.st_size) # or a content hash
def get(self, path):
key = os.path.realpath(path) # normalize
stamp = self._current_stamp(key)
with self._lock:
entry = self._d.get(key)
if entry is not None and entry[0] == stamp:
self._d.move_to_end(key) # LRU touch
return entry[1] # valid hit
# miss or stale -> fall through to compute
value = compute_result(key) # expensive work, outside lock
self.put(key, stamp, value)
return value
def put(self, key, stamp, value):
with self._lock:
self._d[key] = (stamp, value)
self._d.move_to_end(key)
if len(self._d) > self._maxsize:
self._d.popitem(last=False) # evict LRU
```
(For the license-string variant, replace path normalization with string normalization and drop the `stat`-based validation in favor of a TTL or no invalidation, since a string key is self-describing.)
**8. Complexity.**
- *Per-lookup, before caching:* O(cost of the computation) every time — e.g. O(F) to read+parse a file of size F, or O(L) to parse a license string of length L.
- *Per-lookup, after caching:* O(1) average for a hash-map hit (plus an O(1) `stat`/validation if you revalidate), and O(cost of the computation) only on a miss. Amortized over a run with hit rate h, expected cost per lookup ≈ (1 − h)·O(compute) + O(1).
- *Space:* O(number of cached entries × value size), bounded by `maxsize` under LRU/LFU. A plain hash map without eviction is O(distinct inputs) and can grow unbounded — the reason to bound it.
The net win is trading O(distinct-inputs) memory for turning repeated O(compute) work into O(1) hits; it only pays off when the hit rate is high and the cached values stay valid long enough to be reused.
Explanation
The question is a systems-flavored follow-up to a coding task: take an existing script and reason about adding a hash-map-backed cache. A strong answer covers all parts — picks a genuinely cacheable (pure, repeated, deterministic) call; defines normalized keys and the cached value; estimates hit rate from input cardinality vs. lookups; justifies an eviction policy (LRU/LFU/TTL/unbounded) by working-set and memory; handles invalidation via mtime/content-hash or TTL with an explicit staleness trade-off; addresses concurrency (lock or per-key lock, cache-stampede) for threads and process-local vs. shared cache for processes; and states before/after time and space complexity. The key insight is that caching is only correct and worthwhile when the computation is pure and repeated and the cached value can be kept valid.