Design a static audio detection system

Q: Design a static audio detection system

This is a System Design interview question from Roblox for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

System Design: Static Audio Issue Detection Pipeline

Design a production-ready system that detects policy issues in user-uploaded, static audio files (not live streams). The system should automatically process newly saved files, flag problematic segments, and route outcomes to either automatic pass/fail or manual review.

Assume this runs in a multi-tenant UGC platform that ingests diverse audio formats and languages. Focus on scalability, reliability, and cost-aware operation.

Scope and Assumptions

Input: Discrete audio files (e.g., mp3, wav, m4a), typical duration 1–30 minutes, multi-language.
Out of scope: Realtime/live stream moderation, image/video analysis, and client-side SDKs.
Detection examples: keyword/profanity, hate/harassment, sexual content, self-harm, violent threats, piracy/music content, anomalies (e.g., TTS/voice cloning, synthetic manipulation).
Human review exists for borderline cases; reviewers act on time-stamped evidence and transcripts.

Functional Requirements

Ingestion and metadata:
- Accept a saved audio file and register metadata (uploader, duration, checksum, mime, language hint).
- Trigger processing automatically for new files; support reprocessing (e.g., new models/rules).
Detection capabilities:
- ASR transcription with timestamps; language detection.
- Keyword/prohibited-content detection (dictionary + fuzzy/phonetic matches).
- Classifiers for categories (hate, sexual, self-harm, violence); confidence scores and spans.
- Acoustic anomalies (e.g., synthetic voice, splicing artifacts, loudness spikes, silence/low-SNR).
- Optional music/IP detection (fingerprinting).
Decisions and workflow:
- Outcome: PASS (auto-approve), FAIL (auto-reject/quarantine), or REVIEW (manual queue).
- Explanations: evidence spans with timestamps, snippets, transcript excerpts, model confidence, triggered rules.
- Manual review tasks creation and labeling; feedback loops to improve rules/models.
APIs and access:
- Submit/list files, query processing status, retrieve detections/decisions.
- Admin APIs for rules, thresholds, model versions; reprocess by policy version.
Operations:
- Idempotent processing; safe retries; backfills; partial resumption for long files.
- Auditing of decisions and reviewer actions.

Non-Functional Requirements

Throughput: Start at 10k–100k files/day; scale to 1–5M files/day.
Latency (batch/offline): P95 decision within 5–30 minutes of upload for most files; long files may take longer but <4 hours.
Availability: 99.9% APIs; processing can be 99% with queue backpressure.
Durability: Original media 11 nines (object storage); metadata 3x replicated.
Cost: Optimize for $/audio minute; support autoscaling and spot instances; make cloud-vendor-agnostic.
Compliance: Privacy, PII/security by design; data residency if needed.

Key Entities and Data Schema (logical)

AudioFile
- id (UUID), uri (string), checksum (sha256), mime_type (string), duration_sec (int), sample_rate (int), channels (int), size_bytes (int)
- uploader_id (string), tenant_id (string), language_hint (string|null), created_at (ts), status (enum: uploaded|processing|done|failed)
ProcessingJob
- id (UUID), file_id (UUID), pipeline_version (string), state (enum: queued|running|succeeded|failed|canceled)
- started_at (ts), finished_at (ts|null), attempt (int), worker_id (string|null), error_code (string|null)
DetectionEvent
- id (UUID), file_id (UUID), job_id (UUID), detector_type (enum: asr|keyword|classifier|anomaly|ip)
- rule_id (string|null), model_name (string), model_version (string), start_ms (int), end_ms (int), label (string), confidence (float)
- details_json (jsonb) e.g., transcript snippet, tokens, features
Decision
- id (UUID), file_id (UUID), job_id (UUID), outcome (enum: PASS|FAIL|REVIEW)
- reasons (array <string> ), evidence (array<detection_id>), thresholds_version (string), decided_at (ts)
ReviewTask
- id (UUID), file_id (UUID), created_at (ts), assigned_to (string|null), state (enum: open|in_progress|resolved)
- sla_deadline (ts), priority (int), auto_decision (enum|null)
Label (human annotation)
- id (UUID), review_task_id (UUID), file_id (UUID), start_ms (int), end_ms (int), label (string), notes (string), reviewer_id (string), created_at (ts)
AuditLog
- id (UUID), actor (system|user_id), action (string), entity_type (string), entity_id (UUID), timestamp (ts), payload_json (jsonb)
Rule
- id (string), version (string), type (keyword|threshold|heuristic), config_json (jsonb), active (bool), created_at (ts)
ModelRegistry (metadata only)
- model_name (string), version (string), checksum (sha256), uri (string), created_at (ts), validated (bool)

High-Level Architecture

Storage
- Object storage for raw audio and derived artifacts (normalized audio, segments, transcripts).
- Relational DB for metadata and decisions (e.g., Postgres) + analytical store (warehouse/lake) for offline metrics.
Ingestion Service
- Validates upload, computes checksum, writes AudioFile record, posts event (NewFileSaved).
Event Bus / Queue
- Event-driven trigger (e.g., NewFileSaved) into a work queue with backpressure and DLQ.
Orchestrator / Scheduler
- Converts events into ProcessingJobs; partitions by tenant; enforces rate limits and priorities.
Processing Workers (containerized)
- Preprocessing: decode, resample, VAD segmentation; language ID; silence detection.
- ASR engine: streaming chunked inference for timestamps; multi-language.
- Keyword & rule engine: dictionary/fuzzy, phonetic, regex on transcripts with timestamps.
- Content classifiers: multi-label toxic/sexual/self-harm/violence; time-localized.
- Acoustic anomaly detectors: voice cloning/TTS likelihood, tampering, loudness.
- Optional music/IP fingerprinting service.
- Writes DetectionEvents and artifacts; computes Decision.
Result Store and API
- Upsert Decisions; expose GET status, GET detections, POST reprocess.
Review Service + UI
- Task queues, evidence playback (waveform + highlighted spans), reviewer analytics.
Observability
- Metrics (throughput, latency, accuracy), logs, traces, dashboards, alerts.
Model Ops
- Model registry, canary rollout, shadow testing, A/B, threshold management.

End-to-End Flow for Newly Saved Files

Upload and register
- Client uploads to object storage via signed URL; Ingestion Service records AudioFile and computes checksum; emits NewFileSaved(file_id, checksum, pipeline_version).
Job creation
- Orchestrator consumes event, checks idempotency (existing successful job for file_id+pipeline_version?), creates ProcessingJob, enqueues work with priority/tenant partition key.
Processing
- Worker pulls job, locks it, and streams audio from storage.
- Normalize -> VAD segment -> language ID -> ASR per segment with timestamps.
- Run keyword/rule engine on transcript; run classifiers on text and/or audio embeddings; run anomaly detectors.
- Persist DetectionEvents incrementally; store transcript/segments in artifact bucket.
Decisioning
- Apply rules/thresholds to produce PASS/FAIL/REVIEW; write Decision.
- If REVIEW, create ReviewTask with evidence spans.
Completion and callbacks
- Update ProcessingJob to succeeded; publish FileProcessed event; optional webhook to product backend.

Idempotency, Retries, Failure Handling

Idempotency keys
- job_key = hash(file_id, pipeline_version). Before starting, check if a succeeded job exists; if yes, short-circuit.
- Deduplicate uploads by checksum; maintain AudioFile(checksum)->latest file map.
Exactly-once effects (practical)
- Use UPSERT on ProcessingJob and DetectionEvent keyed by (job_id, detector_type, start_ms, end_ms) to avoid duplicates.
- Write artifacts with deterministic paths (file_id/pipeline_version/...).
Retries
- Transient failures: exponential backoff with jitter; max_attempts (e.g., 5). Per-stage retry with checkpoints (segment offsets) to avoid redoing whole file.
- Permanent failures: mark failed with error_code; send to DLQ; alert if rate exceeds threshold.
Timeouts and cancellation
- Enforce per-segment timeout; global max processing time per file based on duration; support cancellation on new upload version.
Backfills/reprocessing
- Re-enqueue by policy/model version; maintain lineage linking Decision to rule/model versions.

Privacy, Security, and PII

Minimize data: store transcripts and short evidence snippets; avoid full audio duplication unless needed.
Encryption: in transit (TLS) and at rest (KMS-managed keys); per-tenant keys if required.
Access controls: RBAC/ABAC for engineers and reviewers; scoped tokens for services; audit every access.
Data residency: tag data with region; route processing to in-region compute; restrict cross-region storage.
Retention: configurable TTL for transcripts and detections; immediate deletion on user request; right-to-erasure workflow.
Safe evidence: redact PII in transcripts before displaying; mask sensitive audio in reviewer UI.

Scaling Plan: Thousands to Millions of Files/Day

Workload characterization
- Throughput target: 1M files/day ≈ 11.6 files/sec average; bursts 10–20x.
- If avg duration = 3 min, audio minutes/day = 3M; with 1.5x real-time ASR speed on CPU or 5x on GPU:
  - CPU: needs ≈ 3M / 1.5 ≈ 2M CPU-min/day ≈ 1,400 vCPU steady (plus overhead).
  - GPU: at 5x RT, ≈ 600k GPU-min/day ≈ 420 GPU-hours/day (e.g., ~18 A10s continuously). Actual numbers depend on model.
Horizontal scaling
- Stateless workers; autoscale on queue depth and segment backlog; shard by tenant and file_id to balance.
- Segment long files into N-second chunks (e.g., 20–30s) to improve parallelism and failure isolation.
Storage and DB
- Object storage scales elastically; enable multi-part reads.
- Partition relational tables by created_at and tenant_id; add indexes on (file_id), (job_id), (detector_type, file_id, start_ms).
- Stream DetectionEvents to warehouse (e.g., via CDC or events) for analytics without overloading OLTP DB.
Cost controls
- Right-size models: small/fast model first-pass; escalate to heavy models only for likely-violations or high-risk tenants.
- Use spot/preemptible instances for batch; checkpoint to tolerate preemption.
- Cache transcripts for identical checksums; reuse results across re-uploads.
Reliability and SLOs
- Multi-AZ workers; queue with at-least-once delivery; DLQ monitoring.
- SLO dashboards: time-to-decision, backlog, failure rate, reviewer SLA.

Cron-Based Batch vs Event/Queue-Driven Processing

Cron-based Batch
- Pros: Simple to operate; good for predictable reprocessing/backfills; easy cost batching (off-peak runs).
- Cons: Higher decision latency; poor with bursty traffic; less natural backpressure; harder per-file prioritization.
- Use cases: nightly reprocessing for new rules/models; large backfills; cost-optimized windows.
Event/Queue-Driven
- Pros: Low latency; natural backpressure; fine-grained prioritization (tenant, content type); better autoscaling.
- Cons: More moving parts (idempotency, ordering, DLQ); requires robust observability.
- Use cases: primary path for new uploads; interactive products that need quick moderation.
Hybrid Strategy
- Event-driven for new files; cron/batch for backfills and model/rule updates. Both share the same worker code; only the scheduler differs.

Additional Considerations and Pitfalls

Multi-language and code-switching: auto language-ID per segment; language-specific keyword lists and models.
Adversarial audio: obfuscation (beeps, pitch-shifts); use phonetic/fuzzy matching and acoustic cues.
Long silences or music: VAD to skip silence; music/IP detection to route to different policy.
Threshold tuning: calibrate per label; use PR curves; different thresholds for auto-fail vs review.
Feedback loop: reviewer labels feed model re-training; track model drift and data quality.
Fairness: evaluate confusion rates across languages/accents; document mitigations.

Small Numerical Example (Throughput and Cost)

Scenario: 100k files/day, avg 2 minutes each -> 200k audio minutes/day.
ASR at 4x real-time on GPU -> 50k GPU-min/day ≈ 35 GPU-hours/day.
If a GPU hour costs $1.50, ASR cost ≈ $52.5/day. Add classifiers/anomalies on CPU at 0.2x ASR cost -> total ≈ $63/day.
Unit cost ≈ $63 / 100k = $0.00063 per file (excluding storage/overheads). Real-world costs vary by model and cloud pricing.

Validation and Guardrails

Shadow deploy new models/rules; compare outcomes on 5–10% traffic; hold-out labeled set for offline metrics.
Monitor precision/recall per category; reviewer disagreement rate; average time-to-decision; false positive appeals.
Implement circuit breakers: if detector spikes errors or flags, auto-disable or raise thresholds; alert on anomalies.
Golden datasets and synthetic test cases for regression; per-language canaries.