You implemented a synchronous API that lists all files under a directory (recursive crawl). Now you need to scale it.
Part A — Scaling discussion (no code required)
How would you scale the “list all files under a path” API when:
-
The directory tree can be huge (millions/billions of files)
-
Crawls can take minutes+ and may time out
-
Many users may request crawls concurrently
-
The underlying storage could be local disks or network storage (e.g., NFS / object-store-like abstraction)
Discuss:
-
API shape (sync vs async)
-
Pagination/streaming of results
-
Concurrency limits and backpressure
-
Caching and deduplication
-
Failure handling and partial results
Part B — Async job implementation (with provided templates)
Based on your design in Part A, implement (or describe at a detailed interface level) an async crawl job:
-
Client submits
rootPath
and gets back a
jobId
.
-
Clients can poll job status and fetch results.
-
The job should be robust to failures, support retries, and avoid repeating work excessively.
Specify key components/APIs, e.g.:
-
POST /crawl-jobs
→
{ jobId }
-
GET /crawl-jobs/{jobId}
→ status/progress
-
GET /crawl-jobs/{jobId}/results?cursor=...
→ paged results
State any assumptions (e.g., maximum result size, retention window, storage type).