Troubleshooting Large Logs Over SSH and Designing Centralized Logging
Context
You are on-call for a production service that is failing. You have SSH access to a Linux host, but the application log files are very large (and may be rotated/compressed). You need to quickly locate relevant errors and determine the problematic time window. If this problem recurs, you should outline a centralized logging/search solution and discuss trade-offs.
Assume:
-
You can use common CLI tools available on most Linux hosts (e.g., journalctl, grep/awk/sed, less, zgrep, lsof).
-
Logs may be written to systemd-journald or to files in /var/log or an app directory, with rotation (e.g., .gz files).
-
Network bandwidth is limited; avoid transferring large files.
Tasks
-
On-host triage: Describe how you would efficiently find relevant errors/time ranges in very large logs. Specify concrete commands, filters, and strategies (including for rotated/compressed logs, multiline stack traces, and time filtering).
-
If this issue recurs: Propose a centralized logging and search architecture. Include ingestion, processing, storage, and query/visualization. Discuss trade-offs among common choices (e.g., Elasticsearch/OpenSearch, Loki, ClickHouse, object storage + query engines, managed services), including cost, scale, performance, and operability.