Design a scalable resume search system
Company: MongoDB
Role: Software Engineer
Category: System Design
Difficulty: hard
Interview Round: Technical Screen
Design a resume search platform for a recruiting company with these requirements:
(
1) Applicants upload resumes (PDF/Word).
(
2) A Resume Server parses resumes into structured metadata, generates thumbnails/resized previews, and stores both metadata and files.
(
3) Recruiters search and view matching resumes. Specify: • High-level architecture and components (applicant/recruiter interfaces, resume server services, data stores). • Data model for resume metadata and file references; discuss schema fields (name, contact, skills, education, experience, locations, uploaded_at) and versioning. • Storage choices for blobs versus metadata (e.g., object storage + document/relational DB) and trade-offs (consistency, transactions, cost). • Asynchronous processing pipeline for parse/resize using a queue and horizontally scalable workers; address idempotency and retries. • Public APIs for upload, search with multiple filters (skills, location, experience, education), and get-by-id; outline request/response shapes and pagination. • Search strategy and indexing with a dedicated search engine; propose mappings/fields, ranking signals, and example queries (term, text, range filters). • Scalability plan: sharding/partitioning, back-pressure, rate limiting, caching/CDN for previews, and data lifecycle policies. • Security and privacy: HTTPS, authentication, RBAC for applicants vs recruiters, signed URLs for file access, PII handling, audit logs. • Observability and reliability: metrics, logs, traces, SLOs, failure handling, dead-letter queues. • Evolution: relevance tuning, typo tolerance, synonym handling, multi-tenant isolation, and cost optimization. Provide capacity estimates (daily uploads, QPS, data size) and justify design choices with trade-offs.
Quick Answer: This question evaluates a candidate's competence in large-scale system architecture, data modeling, full-text search and indexing, asynchronous processing pipelines, storage and consistency trade-offs, API design, and security/privacy for handling uploaded documents.