Find Valid IP Addresses in Files
Company: Amazon
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Technical Screen
You are given an absolute or relative path to a directory on the local file system.
Write a program that recursively traverses every subdirectory under that root directory, reads every regular file, finds all valid IPv4 addresses contained in those files, and prints the addresses in lexicographic order.
A valid IPv4 address has exactly four decimal octets separated by dots. Each octet must be an integer from `0` to `255`. Leading zeros are not allowed unless the octet is exactly `0`. For example, `192.168.0.1` and `0.0.0.0` are valid, while `256.1.1.1`, `01.2.3.4`, `1.2.3`, and `1.2.3.4.5` are invalid.
Clarifications:
- Recursively visit all nested folders under the input path.
- Only scan regular files.
- Treat file contents as text; if a file cannot be decoded or read, handle it gracefully according to your language's best practices.
- If the same valid IP address appears multiple times, print it only once.
- The final output should contain one IP address per line, sorted lexicographically as strings.
Quick Answer: This question evaluates file system traversal, recursive I/O handling, text parsing and pattern recognition, input validation for IPv4 formatting, deduplication, and ordered output; it is in the Coding & Algorithms category and emphasizes practical application over purely conceptual understanding.
This problem is based on scanning a real directory tree. To make it deterministic for an online judge, the file system is provided as a nested dictionary named `filesystem`: a value that is a `dict` is a subdirectory, a value that is a `str` is a text file, and a value that is `bytes` or `bytearray` is a file that should be decoded as UTF-8 and skipped if decoding fails. Traverse every subdirectory, inspect every regular file, find all unique valid IPv4 addresses in the file contents, and return them sorted lexicographically as strings. A valid IPv4 address has exactly four decimal octets separated by dots; each octet must be between `0` and `255`, and leading zeros are not allowed unless the octet is exactly `0`.
Constraints
- 0 <= total number of files and folders <= 10^5
- The total length of all decodable file contents is at most 10^6 characters
- Only leaf values of type `str`, `bytes`, or `bytearray` should be treated as regular files
Examples
Input: ({'readme.txt': 'Main IPs: 10.0.0.1 and 192.168.0.1. Invalid: 256.1.1.1, 192.168.001.1.', 'logs': {'today.log': b'Seen 10.0.0.1 twice and 0.0.0.0 once', 'bad.bin': b'\xff\xfe\xfa'}},)
Expected Output: ['0.0.0.0', '10.0.0.1', '192.168.0.1']