System Design: Risk Management Ticketing System
Context
Design an internal ticketing system for risk/security/compliance issues. End users and teams file and triage tickets; an automated scanner bot also files tickets for detected risks. The system must support operational workflows, SLAs, and monthly reporting across teams.
Assume scale to guide choices:
-
20k active users, 1k bot findings/minute peak, 50–100 QPS on reads, 5–20 QPS writes.
-
10–50 million tickets lifetime, with heavy filtering/search by status, assignee, team, and keyword.
Requirements
-
Authentication and authorization (RBAC: user, triage, admin, bot).
-
Ticket schema: states, priority, assignee, comments, attachments, audit trail of all changes.
-
Workflows and SLAs (e.g., time to first response/triage, time to resolve), pause rules.
-
Idempotent bot submissions (no duplicate tickets for the same finding window).
-
Notifications (email/chat/webhooks) on state/assignee/priority changes and SLA breaches.
-
Monthly reporting: counts and SLA metrics by team.
-
Provide:
-
API endpoints (CRUD, search, comments, audit, reports).
-
Storage choices and indexing strategy.
-
Consistency model.
-
Background jobs for SLA tracking and report generation.
-
Scaling strategy.
-
Failure handling, observability, GDPR/retention considerations.