This question evaluates competency in end-to-end LLM system design for improving multi-step reasoning, including task specification, data construction, prompt and inference strategies, fine-tuning/post-training choices, retrieval and tool integration, evaluation, and operational trade-offs.
You are building an LLM-powered product for a domain-specific task that requires multi-step reasoning. The base model does reasonably well on easy examples, but it often fails on harder cases that require decomposition, intermediate verification, or tool use.
Design an end-to-end plan to improve the model's performance on this reasoning-heavy task. Your answer should cover: