You are asked to design an LLM-powered search system that lets users query a large corpus of documents (e.g., internal wikis, PDFs, logs, and web pages) and receive natural-language answers.
A key challenge is that both documents and user queries can be very long, often exceeding the context window (maximum token length) of the underlying large language model (LLM). For example, a user might paste multiple pages of logs or a long contract as part of their query.
Design the system with a focus on:
-
Overall architecture
-
How documents are stored and indexed.
-
How search queries are processed.
-
How the LLM is used to generate final answers.
-
Handling large token length / context limits
-
How to handle very long
documents
that do not fit into the LLM context.
-
How to handle very long
queries
(e.g., multi-page text pasted by the user).
-
How to avoid blowing past the context window while still providing high-quality, relevant answers.
-
Additional considerations
-
Latency and cost: how you keep response times reasonable and control token usage.
-
Quality: how you keep retrieved content relevant and avoid missing important context when chunking or summarizing.
-
Any caching or optimizations you would introduce.
Describe your design in detail:
-
Draw or describe the main components and data flow (ingestion, indexing, retrieval, LLM interaction, etc.).
-
Explain at least 2–3 concrete strategies for dealing with large token length/context limits, and how they fit into your architecture.
-
Call out trade-offs between different design choices.