Diagnose failures via SSH and large logs

Q: Diagnose failures via SSH and large logs

This question evaluates on-call troubleshooting and observability competencies, including command-line log triage on Linux, handling rotated or compressed logs, isolating error time windows, and high-level design of centralized logging and search architectures.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

Troubleshooting Large Logs Over SSH and Designing Centralized Logging

Context

You are on-call for a production service that is failing. You have SSH access to a Linux host, but the application log files are very large (and may be rotated/compressed). You need to quickly locate relevant errors and determine the problematic time window. If this problem recurs, you should outline a centralized logging/search solution and discuss trade-offs.

Assume:

You can use common CLI tools available on most Linux hosts (e.g., journalctl, grep/awk/sed, less, zgrep, lsof).
Logs may be written to systemd-journald or to files in /var/log or an app directory, with rotation (e.g., .gz files).
Network bandwidth is limited; avoid transferring large files.

Tasks

On-host triage: Describe how you would efficiently find relevant errors/time ranges in very large logs. Specify concrete commands, filters, and strategies (including for rotated/compressed logs, multiline stack traces, and time filtering).
If this issue recurs: Propose a centralized logging and search architecture. Include ingestion, processing, storage, and query/visualization. Discuss trade-offs among common choices (e.g., Elasticsearch/OpenSearch, Loki, ClickHouse, object storage + query engines, managed services), including cost, scale, performance, and operability.

Diagnose failures via SSH and large logs

Quick Overview

Troubleshooting Large Logs Over SSH and Designing Centralized Logging

Context

Tasks

Solution

Comments (0)