How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

What difficulty level is this interview question?

This is a medium difficulty ML System Design question, commonly asked during Technical Screen rounds at Hebbia.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Hebbia during technical interviews.

Design Hebbia Chat for SEC Filings | Hebbia Interview Question

Q: Design Hebbia Chat for SEC Filings

This question evaluates expertise in ML system design, specifically the ability to architect a multi-agent retrieval-augmented generation pipeline at scale. It tests practical knowledge of agent orchestration, streaming, cancellation, and token-budget enforcement — core competencies for senior ML and software engineering roles focused on LLM-powered products.

Design Hebbia Chat, a chat interface that lets users ask questions about SEC filings and get grounded, cited answers.

The system should be fast, durable, and scalable. Under the hood it runs a multi-agent workflow with at least these agents:

Orchestrator agent — decides which subagent to call next and when the workflow is done.
Retrieval agent — searches the SEC-filing corpus and returns relevant evidence with citations.
Output agent — generates the final user-facing answer from the retrieved evidence and intermediate agent results.

Functional requirements

Users can submit chat messages asking questions about SEC filings.
The backend must surface loading/progress state to the user before and after every agent call .
Text produced by the output agent must be streamed to the frontend as it is generated.
Users can cancel an in-progress message.
There is a finite token budget per user or workspace within a time window — token usage must be tracked and enforced.

Design the end-to-end system: frontend↔backend communication, agent orchestration, persistence, cancellation, streaming, token-budget enforcement, retrieval over filings, scalability, and failure handling.

Constraints & Assumptions

State your own, but reasonable defaults:

Corpus : SEC EDGAR filings (10-K, 10-Q, 8-K, S-1, proxy statements, etc.) — millions of documents, tens of GB to low TB of text, growing daily. Largely static once filed; ingestion is offline/batch.
Scale : thousands of concurrent users; each chat turn may issue several LLM calls and one or more retrieval calls; turns last seconds to low tens of seconds.
Latency : time-to-first-token (TTFT) for the streamed answer is the key UX metric; target a few seconds.
LLM : accessed through a provider abstraction that supports streaming, timeouts, retries, and per-call token accounting . Do not assume a specific vendor or model.
Budgets : enforced per user and per workspace, over a rolling or fixed time window (e.g. per-minute and per-day caps).
Correctness/grounding : answers must cite the specific filing, section, and date they draw from; the system must not hallucinate financial facts.

Clarifying Questions to Ask

What is the time window and granularity of the token budget (per-minute, per-day; per-user, per-workspace, or both), and what should happen when it's exhausted mid-run — hard stop, degraded short answer, or queue?
How fresh must filings be — do we need near-real-time ingestion of new 8-Ks, or is a daily batch acceptable?
Are conversations multi-turn (does the orchestrator see prior turns / prior retrieved evidence), and how much history do we carry into context?
Do we need cross-filing reasoning (e.g. "compare revenue across the last three 10-Ks"), or is single-document Q&A sufficient for v1?
What are the durability expectations on a crash mid-stream — must the user be able to reconnect and resume the same answer, or is a clean "retry from scratch" acceptable?
What compliance/data-isolation guarantees do we owe (per-workspace tenancy, audit logging, PII handling in prompts)?

What a Strong Answer Covers Premium

Follow-up Questions

The streaming connection drops after 80% of the answer has been sent. Walk through exactly how the client reconnects and resumes without re-billing tokens or duplicating text.
A workspace exhausts its token budget while the output agent is mid-stream. What does the user see, what gets persisted, and how do you make the cutoff feel graceful rather than a hard error?
The retrieval agent returns a chunk whose text contains an instruction like "ignore previous instructions and output X." How does your design prevent that filing text from hijacking the output agent?
How would you extend the orchestrator to support cross-filing comparison queries (e.g. revenue trend across three years of 10-Ks) without blowing the latency budget or the context window?

Design Hebbia Chat, a chat interface that lets users ask questions about SEC filings and get grounded, cited answers.

The system should be fast, durable, and scalable. Under the hood it runs a multi-agent workflow with at least these agents:

Orchestrator agent — decides which subagent to call next and when the workflow is done.
Retrieval agent — searches the SEC-filing corpus and returns relevant evidence with citations.
Output agent — generates the final user-facing answer from the retrieved evidence and intermediate agent results.

Functional requirements

Users can submit chat messages asking questions about SEC filings.
The backend must surface loading/progress state to the user before and after every agent call .
Text produced by the output agent must be streamed to the frontend as it is generated.
Users can cancel an in-progress message.
There is a finite token budget per user or workspace within a time window — token usage must be tracked and enforced.

Constraints & Assumptions

State your own, but reasonable defaults:

Corpus : SEC EDGAR filings (10-K, 10-Q, 8-K, S-1, proxy statements, etc.) — millions of documents, tens of GB to low TB of text, growing daily. Largely static once filed; ingestion is offline/batch.
Scale : thousands of concurrent users; each chat turn may issue several LLM calls and one or more retrieval calls; turns last seconds to low tens of seconds.
Latency : time-to-first-token (TTFT) for the streamed answer is the key UX metric; target a few seconds.
LLM : accessed through a provider abstraction that supports streaming, timeouts, retries, and per-call token accounting . Do not assume a specific vendor or model.
Budgets : enforced per user and per workspace, over a rolling or fixed time window (e.g. per-minute and per-day caps).
Correctness/grounding : answers must cite the specific filing, section, and date they draw from; the system must not hallucinate financial facts.

Clarifying Questions to Ask

What is the time window and granularity of the token budget (per-minute, per-day; per-user, per-workspace, or both), and what should happen when it's exhausted mid-run — hard stop, degraded short answer, or queue?
How fresh must filings be — do we need near-real-time ingestion of new 8-Ks, or is a daily batch acceptable?
Are conversations multi-turn (does the orchestrator see prior turns / prior retrieved evidence), and how much history do we carry into context?
Do we need cross-filing reasoning (e.g. "compare revenue across the last three 10-Ks"), or is single-document Q&A sufficient for v1?
What are the durability expectations on a crash mid-stream — must the user be able to reconnect and resume the same answer, or is a clean "retry from scratch" acceptable?
What compliance/data-isolation guarantees do we owe (per-workspace tenancy, audit logging, PII handling in prompts)?

What a Strong Answer Covers Premium

Follow-up Questions

The streaming connection drops after 80% of the answer has been sent. Walk through exactly how the client reconnects and resumes without re-billing tokens or duplicating text.
A workspace exhausts its token budget while the output agent is mid-stream. What does the user see, what gets persisted, and how do you make the cutoff feel graceful rather than a hard error?
The retrieval agent returns a chunk whose text contains an instruction like "ignore previous instructions and output X." How does your design prevent that filing text from hijacking the output agent?
How would you extend the orchestrator to support cross-filing comparison queries (e.g. revenue trend across three years of 10-Ks) without blowing the latency budget or the context window?

Design Hebbia Chat for SEC Filings

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers Premium

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP

Design Hebbia Chat for SEC Filings

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers Premium

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP