Design Hebbia Chat for SEC Filings
Company: Hebbia
Role: Software Engineer
Category: ML System Design
Difficulty: medium
Interview Round: Technical Screen
Design **Hebbia Chat**, a chat interface that lets users ask questions about SEC filings and get grounded, cited answers.
The system should be fast, durable, and scalable. Under the hood it runs a **multi-agent workflow** with at least these agents:
- **Orchestrator agent** — decides which subagent to call next and when the workflow is done.
- **Retrieval agent** — searches the SEC-filing corpus and returns relevant evidence with citations.
- **Output agent** — generates the final user-facing answer from the retrieved evidence and intermediate agent results.
**Functional requirements**
1. Users can submit chat messages asking questions about SEC filings.
2. The backend must surface loading/progress state to the user **before and after every agent call**.
3. Text produced by the output agent must be **streamed** to the frontend as it is generated.
4. Users can **cancel** an in-progress message.
5. There is a **finite token budget per user or workspace within a time window** — token usage must be tracked and enforced.
Design the end-to-end system: frontend↔backend communication, agent orchestration, persistence, cancellation, streaming, token-budget enforcement, retrieval over filings, scalability, and failure handling.
```hint What unit of work makes the five requirements tractable?
The five requirements (progress events, streaming, cancellation, budgets, retries) share a common challenge: a single user turn spans multiple agent calls and may outlive a single server process. Think about what durable, persistent unit of work would let every requirement — progress visibility, crash recovery, reconnect, budget tracking, idempotent retry — be derived from the same stored state rather than solved independently.
```
```hint Frontend channel
You need server→client push for progress and token deltas. Compare **SSE** (one-way, simple, auto-reconnect, plays well with HTTP/2) vs **WebSocket** (bidirectional, heavier). Consider decoupling the *write* request (POST the message) from the *stream* request (GET the event stream) and what that enables for reconnection.
```
```hint Cancellation and budget enforcement are both asynchronous checks
A running workflow spans multiple agent calls and LLM streams. Think about *when* and *where* in that execution you can cleanly check whether to abort or stop spending. Consider what needs to be stored vs what can be in memory, and what happens to any partially generated output in each case.
```
### Constraints & Assumptions
State your own, but reasonable defaults:
- **Corpus**: SEC EDGAR filings (10-K, 10-Q, 8-K, S-1, proxy statements, etc.) — millions of documents, tens of GB to low TB of text, growing daily. Largely static once filed; ingestion is offline/batch.
- **Scale**: thousands of concurrent users; each chat turn may issue several LLM calls and one or more retrieval calls; turns last seconds to low tens of seconds.
- **Latency**: time-to-first-token (TTFT) for the streamed answer is the key UX metric; target a few seconds.
- **LLM**: accessed through a provider abstraction that supports **streaming, timeouts, retries, and per-call token accounting**. Do not assume a specific vendor or model.
- **Budgets**: enforced per user *and* per workspace, over a rolling or fixed time window (e.g. per-minute and per-day caps).
- **Correctness/grounding**: answers must cite the specific filing, section, and date they draw from; the system must not hallucinate financial facts.
### Clarifying Questions to Ask
- What is the time window and granularity of the token budget (per-minute, per-day; per-user, per-workspace, or both), and what should happen when it's exhausted mid-run — hard stop, degraded short answer, or queue?
- How fresh must filings be — do we need near-real-time ingestion of new 8-Ks, or is a daily batch acceptable?
- Are conversations multi-turn (does the orchestrator see prior turns / prior retrieved evidence), and how much history do we carry into context?
- Do we need cross-filing reasoning (e.g. "compare revenue across the last three 10-Ks"), or is single-document Q&A sufficient for v1?
- What are the durability expectations on a crash mid-stream — must the user be able to reconnect and resume the same answer, or is a clean "retry from scratch" acceptable?
- What compliance/data-isolation guarantees do we owe (per-workspace tenancy, audit logging, PII handling in prompts)?
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- The streaming connection drops after 80% of the answer has been sent. Walk through exactly how the client reconnects and resumes without re-billing tokens or duplicating text.
- A workspace exhausts its token budget while the output agent is mid-stream. What does the user see, what gets persisted, and how do you make the cutoff feel graceful rather than a hard error?
- The retrieval agent returns a chunk whose text contains an instruction like "ignore previous instructions and output X." How does your design prevent that filing text from hijacking the output agent?
- How would you extend the orchestrator to support cross-filing comparison queries (e.g. revenue trend across three years of 10-Ks) without blowing the latency budget or the context window?
Quick Answer: This question evaluates expertise in ML system design, specifically the ability to architect a multi-agent retrieval-augmented generation pipeline at scale. It tests practical knowledge of agent orchestration, streaming, cancellation, and token-budget enforcement — core competencies for senior ML and software engineering roles focused on LLM-powered products.