System Design: Static Audio Issue Detection Pipeline
Design a production-ready system that detects policy issues in user-uploaded, static audio files (not live streams). The system should automatically process newly saved files, flag problematic segments, and route outcomes to either automatic pass/fail or manual review.
Assume this runs in a multi-tenant UGC platform that ingests diverse audio formats and languages. Focus on scalability, reliability, and cost-aware operation.
Scope and Assumptions
-
Input: Discrete audio files (e.g., mp3, wav, m4a), typical duration 1–30 minutes, multi-language.
-
Out of scope: Realtime/live stream moderation, image/video analysis, and client-side SDKs.
-
Detection examples: keyword/profanity, hate/harassment, sexual content, self-harm, violent threats, piracy/music content, anomalies (e.g., TTS/voice cloning, synthetic manipulation).
-
Human review exists for borderline cases; reviewers act on time-stamped evidence and transcripts.
Functional Requirements
-
Ingestion and metadata:
-
Accept a saved audio file and register metadata (uploader, duration, checksum, mime, language hint).
-
Trigger processing automatically for new files; support reprocessing (e.g., new models/rules).
-
Detection capabilities:
-
ASR transcription with timestamps; language detection.
-
Keyword/prohibited-content detection (dictionary + fuzzy/phonetic matches).
-
Classifiers for categories (hate, sexual, self-harm, violence); confidence scores and spans.
-
Acoustic anomalies (e.g., synthetic voice, splicing artifacts, loudness spikes, silence/low-SNR).
-
Optional music/IP detection (fingerprinting).
-
Decisions and workflow:
-
Outcome: PASS (auto-approve), FAIL (auto-reject/quarantine), or REVIEW (manual queue).
-
Explanations: evidence spans with timestamps, snippets, transcript excerpts, model confidence, triggered rules.
-
Manual review tasks creation and labeling; feedback loops to improve rules/models.
-
APIs and access:
-
Submit/list files, query processing status, retrieve detections/decisions.
-
Admin APIs for rules, thresholds, model versions; reprocess by policy version.
-
Operations:
-
Idempotent processing; safe retries; backfills; partial resumption for long files.
-
Auditing of decisions and reviewer actions.
Non-Functional Requirements
-
Throughput: Start at 10k–100k files/day; scale to 1–5M files/day.
-
Latency (batch/offline): P95 decision within 5–30 minutes of upload for most files; long files may take longer but <4 hours.
-
Availability: 99.9% APIs; processing can be 99% with queue backpressure.
-
Durability: Original media 11 nines (object storage); metadata 3x replicated.
-
Cost: Optimize for $/audio minute; support autoscaling and spot instances; make cloud-vendor-agnostic.
-
Compliance: Privacy, PII/security by design; data residency if needed.
Key Entities and Data Schema (logical)
-
AudioFile
-
id (UUID), uri (string), checksum (sha256), mime_type (string), duration_sec (int), sample_rate (int), channels (int), size_bytes (int)
-
uploader_id (string), tenant_id (string), language_hint (string|null), created_at (ts), status (enum: uploaded|processing|done|failed)
-
ProcessingJob
-
id (UUID), file_id (UUID), pipeline_version (string), state (enum: queued|running|succeeded|failed|canceled)
-
started_at (ts), finished_at (ts|null), attempt (int), worker_id (string|null), error_code (string|null)
-
DetectionEvent
-
id (UUID), file_id (UUID), job_id (UUID), detector_type (enum: asr|keyword|classifier|anomaly|ip)
-
rule_id (string|null), model_name (string), model_version (string), start_ms (int), end_ms (int), label (string), confidence (float)
-
details_json (jsonb) e.g., transcript snippet, tokens, features
-
Decision
-
id (UUID), file_id (UUID), job_id (UUID), outcome (enum: PASS|FAIL|REVIEW)
-
reasons (array
<string>
), evidence (array<detection_id>), thresholds_version (string), decided_at (ts)
-
ReviewTask
-
id (UUID), file_id (UUID), created_at (ts), assigned_to (string|null), state (enum: open|in_progress|resolved)
-
sla_deadline (ts), priority (int), auto_decision (enum|null)
-
Label (human annotation)
-
id (UUID), review_task_id (UUID), file_id (UUID), start_ms (int), end_ms (int), label (string), notes (string), reviewer_id (string), created_at (ts)
-
AuditLog
-
id (UUID), actor (system|user_id), action (string), entity_type (string), entity_id (UUID), timestamp (ts), payload_json (jsonb)
-
Rule
-
id (string), version (string), type (keyword|threshold|heuristic), config_json (jsonb), active (bool), created_at (ts)
-
ModelRegistry (metadata only)
-
model_name (string), version (string), checksum (sha256), uri (string), created_at (ts), validated (bool)
High-Level Architecture
-
Storage
-
Object storage for raw audio and derived artifacts (normalized audio, segments, transcripts).
-
Relational DB for metadata and decisions (e.g., Postgres) + analytical store (warehouse/lake) for offline metrics.
-
Ingestion Service
-
Validates upload, computes checksum, writes AudioFile record, posts event (NewFileSaved).
-
Event Bus / Queue
-
Event-driven trigger (e.g., NewFileSaved) into a work queue with backpressure and DLQ.
-
Orchestrator / Scheduler
-
Converts events into ProcessingJobs; partitions by tenant; enforces rate limits and priorities.
-
Processing Workers (containerized)
-
Preprocessing: decode, resample, VAD segmentation; language ID; silence detection.
-
ASR engine: streaming chunked inference for timestamps; multi-language.
-
Keyword & rule engine: dictionary/fuzzy, phonetic, regex on transcripts with timestamps.
-
Content classifiers: multi-label toxic/sexual/self-harm/violence; time-localized.
-
Acoustic anomaly detectors: voice cloning/TTS likelihood, tampering, loudness.
-
Optional music/IP fingerprinting service.
-
Writes DetectionEvents and artifacts; computes Decision.
-
Result Store and API
-
Upsert Decisions; expose GET status, GET detections, POST reprocess.
-
Review Service + UI
-
Task queues, evidence playback (waveform + highlighted spans), reviewer analytics.
-
Observability
-
Metrics (throughput, latency, accuracy), logs, traces, dashboards, alerts.
-
Model Ops
-
Model registry, canary rollout, shadow testing, A/B, threshold management.
End-to-End Flow for Newly Saved Files
-
Upload and register
-
Client uploads to object storage via signed URL; Ingestion Service records AudioFile and computes checksum; emits NewFileSaved(file_id, checksum, pipeline_version).
-
Job creation
-
Orchestrator consumes event, checks idempotency (existing successful job for file_id+pipeline_version?), creates ProcessingJob, enqueues work with priority/tenant partition key.
-
Processing
-
Worker pulls job, locks it, and streams audio from storage.
-
Normalize -> VAD segment -> language ID -> ASR per segment with timestamps.
-
Run keyword/rule engine on transcript; run classifiers on text and/or audio embeddings; run anomaly detectors.
-
Persist DetectionEvents incrementally; store transcript/segments in artifact bucket.
-
Decisioning
-
Apply rules/thresholds to produce PASS/FAIL/REVIEW; write Decision.
-
If REVIEW, create ReviewTask with evidence spans.
-
Completion and callbacks
-
Update ProcessingJob to succeeded; publish FileProcessed event; optional webhook to product backend.
Idempotency, Retries, Failure Handling
-
Idempotency keys
-
job_key = hash(file_id, pipeline_version). Before starting, check if a succeeded job exists; if yes, short-circuit.
-
Deduplicate uploads by checksum; maintain AudioFile(checksum)->latest file map.
-
Exactly-once effects (practical)
-
Use UPSERT on ProcessingJob and DetectionEvent keyed by (job_id, detector_type, start_ms, end_ms) to avoid duplicates.
-
Write artifacts with deterministic paths (file_id/pipeline_version/...).
-
Retries
-
Transient failures: exponential backoff with jitter; max_attempts (e.g., 5). Per-stage retry with checkpoints (segment offsets) to avoid redoing whole file.
-
Permanent failures: mark failed with error_code; send to DLQ; alert if rate exceeds threshold.
-
Timeouts and cancellation
-
Enforce per-segment timeout; global max processing time per file based on duration; support cancellation on new upload version.
-
Backfills/reprocessing
-
Re-enqueue by policy/model version; maintain lineage linking Decision to rule/model versions.
Privacy, Security, and PII
-
Minimize data: store transcripts and short evidence snippets; avoid full audio duplication unless needed.
-
Encryption: in transit (TLS) and at rest (KMS-managed keys); per-tenant keys if required.
-
Access controls: RBAC/ABAC for engineers and reviewers; scoped tokens for services; audit every access.
-
Data residency: tag data with region; route processing to in-region compute; restrict cross-region storage.
-
Retention: configurable TTL for transcripts and detections; immediate deletion on user request; right-to-erasure workflow.
-
Safe evidence: redact PII in transcripts before displaying; mask sensitive audio in reviewer UI.
Scaling Plan: Thousands to Millions of Files/Day
-
Workload characterization
-
Throughput target: 1M files/day ≈ 11.6 files/sec average; bursts 10–20x.
-
If avg duration = 3 min, audio minutes/day = 3M; with 1.5x real-time ASR speed on CPU or 5x on GPU:
-
CPU: needs ≈ 3M / 1.5 ≈ 2M CPU-min/day ≈ 1,400 vCPU steady (plus overhead).
-
GPU: at 5x RT, ≈ 600k GPU-min/day ≈ 420 GPU-hours/day (e.g., ~18 A10s continuously). Actual numbers depend on model.
-
Horizontal scaling
-
Stateless workers; autoscale on queue depth and segment backlog; shard by tenant and file_id to balance.
-
Segment long files into N-second chunks (e.g., 20–30s) to improve parallelism and failure isolation.
-
Storage and DB
-
Object storage scales elastically; enable multi-part reads.
-
Partition relational tables by created_at and tenant_id; add indexes on (file_id), (job_id), (detector_type, file_id, start_ms).
-
Stream DetectionEvents to warehouse (e.g., via CDC or events) for analytics without overloading OLTP DB.
-
Cost controls
-
Right-size models: small/fast model first-pass; escalate to heavy models only for likely-violations or high-risk tenants.
-
Use spot/preemptible instances for batch; checkpoint to tolerate preemption.
-
Cache transcripts for identical checksums; reuse results across re-uploads.
-
Reliability and SLOs
-
Multi-AZ workers; queue with at-least-once delivery; DLQ monitoring.
-
SLO dashboards: time-to-decision, backlog, failure rate, reviewer SLA.
Cron-Based Batch vs Event/Queue-Driven Processing
-
Cron-based Batch
-
Pros: Simple to operate; good for predictable reprocessing/backfills; easy cost batching (off-peak runs).
-
Cons: Higher decision latency; poor with bursty traffic; less natural backpressure; harder per-file prioritization.
-
Use cases: nightly reprocessing for new rules/models; large backfills; cost-optimized windows.
-
Event/Queue-Driven
-
Pros: Low latency; natural backpressure; fine-grained prioritization (tenant, content type); better autoscaling.
-
Cons: More moving parts (idempotency, ordering, DLQ); requires robust observability.
-
Use cases: primary path for new uploads; interactive products that need quick moderation.
-
Hybrid Strategy
-
Event-driven for new files; cron/batch for backfills and model/rule updates. Both share the same worker code; only the scheduler differs.
Additional Considerations and Pitfalls
-
Multi-language and code-switching: auto language-ID per segment; language-specific keyword lists and models.
-
Adversarial audio: obfuscation (beeps, pitch-shifts); use phonetic/fuzzy matching and acoustic cues.
-
Long silences or music: VAD to skip silence; music/IP detection to route to different policy.
-
Threshold tuning: calibrate per label; use PR curves; different thresholds for auto-fail vs review.
-
Feedback loop: reviewer labels feed model re-training; track model drift and data quality.
-
Fairness: evaluate confusion rates across languages/accents; document mitigations.
Small Numerical Example (Throughput and Cost)
-
Scenario: 100k files/day, avg 2 minutes each -> 200k audio minutes/day.
-
ASR at 4x real-time on GPU -> 50k GPU-min/day ≈ 35 GPU-hours/day.
-
If a GPU hour costs $1.50, ASR cost ≈ $52.5/day. Add classifiers/anomalies on CPU at 0.2x ASR cost -> total ≈ $63/day.
-
Unit cost ≈ $63 / 100k = $0.00063 per file (excluding storage/overheads). Real-world costs vary by model and cloud pricing.
Validation and Guardrails
-
Shadow deploy new models/rules; compare outcomes on 5–10% traffic; hold-out labeled set for offline metrics.
-
Monitor precision/recall per category; reviewer disagreement rate; average time-to-decision; false positive appeals.
-
Implement circuit breakers: if detector spikes errors or flags, auto-disable or raise thresholds; alert on anomalies.
-
Golden datasets and synthetic test cases for regression; per-language canaries.