Adobe Document Cloud Search Indexing And Autocomplete
Asked of: Software Engineer
Last updated

What's being tested
Candidates must demonstrate end-to-end systems design and implementation judgment for a large-scale document search and autocomplete system: how to build a robust indexing pipeline, choose index data structures, meet latency/throughput SLAs, and handle updates/deletes and failure modes. Interviewers probe the candidate's ability to trade off real-time indexing vs. cost, pick appropriate search structures (memory vs disk), reason about sharding/replication, and design low-latency autocomplete that tolerates typos and scale.
Core knowledge
-
Inverted index: core structure mapping terms → posting lists (docID, positions, term-frequency). Use compressed posting formats (
Gamma/VarInt,PForDelta) for disk efficiency and skip lists for fast seeks. -
Tokenization & analyzers: tokenizers, stopword removal, stemming (
Porter), and n-gram/edge n-gram analyzers shape what queries match; mismatched analyzer between index/query causes silent failures. -
Finite-state transducer (FST) and trie: FSTs (e.g.,
LuceneFST) are compact, memory-efficient structures for prefix autocomplete; store outputs for frequencies or docIDs for fast prefix lookup. -
Fuzzy matching: Levenshtein automata or
BK-treesallow typo tolerance; Levenshtein-based automata integrated inLuceneprovide bounded-edit-distance queries at predictable cost. -
Near-real-time (NRT) indexing: implement with a fast in-memory segment + periodic refresh to make docs searchable; tune refresh interval to trade indexing latency vs query throughput and merge overhead.
-
Sharding & replication: shard by docID or tenant to scale writes/reads; replicas provide high availability and read scaling; plan shard count for expected growth — re-sharding is expensive.
-
Index merging & compaction: merges reduce segment count but are I/O heavy; control merge policies to avoid write stalls; use background merge threads and backpressure on ingestion.
-
Bulk ingestion & idempotency: batch documents and use idempotency keys to make retries safe; stream input via
Kafkaor change-data-capture, store raw payloads inS3for reprocessing. -
Update/delete semantics: updates often implemented as delete + reindex with versioning/tombstones; support versioned writes and periodic GC of tombstones to reclaim space.
-
Autocomplete ranking signals: precompute popularity counts, combine prefix matches with recency, personalization proxies, and context filters; keep suggestion data in an in-memory store (hot) and fallback to index (cold).
-
Capacity planning & SLAs: for target
QPSandp99latency, budget CPU, memory (FSTs often memory-resident), disk IOPS, and network; example: 10kQPSwithp99< 50ms may need several dozen query-serving nodes depending on index size. -
Monitoring & correctness: monitor
indexing-lag,refresh-time,merge-time,heap-pct, and search error rates; build diff tests to detect tokenization mismatches or lost updates.
Worked example — "Design a scalable indexing pipeline for Document Cloud that supports near-real-time search and updates"
First 30s: clarify SLAs (indexing latency target — e.g., <5s; query p99 — e.g., <100ms), scale (docs/day, average doc size), multi-tenancy and ACL requirements. Skeleton answer pillars: (1) ingestion layer: Kafka topics with producers that emit raw payloads and idempotency keys, (2) parsing layer: scalable parser workers (PDF → text, OCR fallback) that emit normalized tokens and metadata, (3) indexing layer: write to distributed Lucene-based clusters with sharding and replicas, using micro-batches and a tuned refresh interval for NRT, (4) maintenance: merges, tombstone GC, and reindex jobs. One tradeoff to flag: decreasing refresh interval lowers visibility latency but increases segment churn and CPU; choose micro-batch size and refresh frequency per tenant SLAs. Close by noting short-term priorities (implement idempotent bulk API, monitor indexing lag) and long-term work (cold storage reindexing, cross-cluster replication) if more time were available.
A second angle — "Design an autocomplete service for document search supporting typo-tolerance and personalization"
Autocomplete shifts priorities: ultra-low latency (single-digit ms) and memory-efficient data structures. Build a hot path with an in-memory FST per shard for edge n-grams storing suggestion scores, refreshed asynchronously from the main index every few seconds. For typo-tolerance, precompute popular fuzzy variants or use Levenshtein automata limited to edit distance 1 for top prefixes; heavy fuzzy queries must be rate-limited or routed to a more expensive backend. Personalization applies signals (user history, document access) to re-rank suggestions at query time; compute lightweight features in-memory and apply reranking within the latency budget. The framing differences: autocomplete favors precomputation and memory footprint; full-text search tolerates higher per-query cost and greater disk I/O.
Common pitfalls
Pitfall: Assuming index and query analyzers are interchangeable.
If you don't explicitly align tokenization between index-time and query-time, results will be inconsistent (e.g., n-gram at index but whitespace on query). Always document analyzer contracts and include round-trip tests.
Pitfall: Designing for average latency only.
Focusing on mean latency hides tail problems; interviewers expectp95/p99reasoning and how you'll prevent head-of-line blocking during merges or GC pauses (e.g., circuit-breakers, backpressure).
Pitfall: Overfocusing on ML ranking for autocomplete.
Spending the interview on model architecture instead of core infra (FST, refresh cadence, memory budget) misses the SWE scope; state the ML signals and where they plug in, but prioritize operational constraints.
Connections
Interviewers may pivot to adjacent topics like distributed consensus for cross-region index replication, caching strategies (cold vs hot shards, CDN for big-result sets), or observability (tracing, request sampling) to validate tail latency and correctness.
Further reading
-
Lucene in Action / Lucene documentation — concrete implementations of inverted indexes, FSTs, and analyzers.
-
[Designing Data-Intensive Applications — Martin Kleppmann] — staple for reasoning about replication, partitions, and stream processing tradeoffs.
Related concepts
- Adobe Search Indexing And Autocomplete
- Adobe Asset Search Indexing And Autocomplete
- Adobe Creative Cloud asset search, indexing, autocomplete, and sharding
- Adobe Document Cloud real-time collaboration and offline sync
- Search, Autocomplete And Restaurant DiscoverySystem Design
- Adobe Creative Cloud Asset Sync And Conflict Resolution