Audio Detection System: ML Inference and Data Pipeline (Model Architecture Out of Scope)
Context and Assumptions
Design the machine learning inference and data pipeline for an audio detection system that flags policy-relevant speech and keywords. Assume:
-
Near real-time decisions on streaming or short audio chunks, plus offline batch processing for analytics/retraining.
-
Multi-lingual audio, variable noise conditions, and potential background music.
-
Model architecture is out of scope; focus on pipeline, features, calibration, thresholds, data contracts, labeling, drift, and versioning.
Requirements
Describe:
-
Feature extraction choices and ordering:
-
Denoising and voice activity detection (VAD).
-
Acoustic features (e.g., spectrograms, MFCCs) and embeddings.
-
Speech-to-text (ASR) and text features, including keyword spotting.
-
Inference outputs: how scores are computed, calibrated, and thresholded; how to handle class imbalance.
-
Data contracts for model inputs/outputs and storage plan for transcripts, embeddings, and intermediate artifacts.
-
Label generation: manual labeling workflows and how to feed labels back (active learning).
-
Drift detection and operational versioning of models and thresholds.