Design and Implement a Minimal LLM-Powered RAG Agent (Python, Mistral API)
Context
You are asked to build a minimal, but production-minded, retrieval-augmented generation (RAG) agent in Python that uses the Mistral API for embeddings and chat completions. The agent should work over a small local corpus (e.g., Markdown and PDF files), support streaming responses, maintain a short conversation memory, and include basic reliability features.
Assume Python 3.10+ and that external dependencies can be installed. The Mistral API token is provided via an environment variable.
Deliverables
Implement a minimal Python tool that:
-
Ingests a small local corpus (Markdown/PDF), chunks text, generates embeddings, and builds an index.
-
Retrieves top-k passages for a user query.
-
Composes a prompt and invokes Mistral chat/completions with streaming.
-
Keeps short-term conversation memory.
-
Reads the API token from an environment variable.
-
Handles errors, retries, and rate limits.
-
Includes a brief README and smoke tests.
Then, explain your system design choices, covering:
-
Chunking strategy
-
Embedding model choice
-
Vector index selection
-
Latency and cost budgeting
-
Caching
-
Prompt templating
-
Safety/PII handling
-
Logging/monitoring
-
Offline RAG evaluation
-
Fallbacks when retrieval fails