##### Question
Implement a utility `tail(path: string, N: int)` (and an equivalent stream variant) that prints the **last N lines** of input without loading the whole input into memory. Assume variable-length lines, an unknown total line count, and a file that may be many gigabytes.
Your solution must address both input modes and the full set of correctness constraints below:
1. **Seekable / random-access file.** Use either a backward scan (seek to the end and read fixed-size blocks toward the front, counting newlines until you have collected N lines) or a fixed-size ring buffer over a forward read. Avoid re-reading the whole file.
2. **Non-seekable / unbounded stream (stdin or a pipe).** Seeking is impossible and the length is unknown, so maintain only **O(N)** memory — a ring buffer of the last N lines — while reading forward exactly once.
3. **Memory bound.** Never read the entire file into memory; memory must be bounded by N (and one line), not by file size.
4. **Newline conventions.** Correctly handle Unix `LF` (`\n`), Windows `CRLF` (`\r\n`), and classic Mac `CR` (`\r`) line endings, including a mix.
5. **UTF-8 correctness.** Split on line boundaries at the byte level but preserve UTF-8 multibyte characters — never cut a character in the middle when reading or buffering blocks.
6. **Missing trailing newline.** A final line not terminated by a newline must still count as a line and be emitted.
7. **Extremely long lines.** A single line may exceed your buffer/block size; handle lines larger than the read buffer without corruption.
8. State the **time and space complexity** of each approach and give the **key test cases** you would write.
**Follow-ups the interviewer will push on:**
- Extend to `tail -f`: follow the file as it grows, emitting new lines as they are appended.
- Handle **log rotation / truncation** and **symlinks** (detect when the file is replaced or truncated and re-open / reset position).
- Support the `-c N` option: print the last N **bytes** instead of the last N lines.
Quick Answer: Confluent software-engineer technical-screen coding question: implement tail(path, N) to print the last N lines of a possibly multi-GB file or an unbounded stream without loading it into memory. It tests a backward block scan for seekable files versus an O(N) ring buffer for non-seekable streams, plus correct handling of \n/\r\n/\r newlines, UTF-8 multibyte characters, missing trailing newlines, and extremely long lines, with follow-ups on tail -f, log rotation/truncation, and the -c last-N-bytes option.
Solution
The whole problem turns on one branch: **can you seek?** A regular file is seekable, so you can jump to the end and walk *backward*, paying only for the bytes spanned by the last $N$ lines. A pipe / stdin is not seekable and its length is unknown, so you must read *forward* exactly once and keep only the last $N$ lines in an $O(N)$ ring buffer. Both keep memory bounded by $N$ (plus one line), never by file size.
The hard part is not the high-level strategy — it is getting the byte-level details right: mixed `\n` / `\r\n` / `\r`, a `\r\n` straddling a block boundary, multibyte UTF‑8 never cut mid-character, an unterminated final line still counting, and a single line larger than the read buffer.
### Design decision: work in bytes, decode last
I detect line boundaries on the **raw byte stream**, then UTF‑8-decode each finished line. This is the key invariant that makes UTF‑8 safety free: a newline (`0x0A`) or carriage return (`0x0D`) is a single ASCII byte that can never appear *inside* a UTF‑8 multibyte sequence (continuation bytes are all `0x80–0xBF`). So splitting on `\n` / `\r` at the byte level can never land in the middle of a character — I only have to make sure each emitted line is decoded as a complete byte span.
Both modes share one splitter. `_scan(buf)` returns the complete lines closed by a terminator inside `buf`, plus the trailing fragment after the last terminator (the start of the next, not-yet-closed line):
```python
# Split a byte span on \n, \r\n, and lone \r (each counts as ONE terminator).
# Returns (complete_lines, trailing_fragment). trailing is the bytes after the
# final terminator, or b'' if the span ends exactly on one.
def _scan(buf):
lines = []
i, start, n = 0, 0, len(buf)
while i < n:
b = buf[i]
if b == 0x0A: # \n
lines.append(buf[start:i]); i += 1; start = i
elif b == 0x0D: # \r, maybe \r\n
lines.append(buf[start:i])
i += 2 if (i + 1 < n and buf[i + 1] == 0x0A) else 1
start = i
else:
i += 1
return lines, buf[start:]
```
---
### 1. Seekable file — backward block scan, $O(N)$ memory
Seek to EOF and read fixed-size blocks (64 KiB) moving toward the front, **prepending** each block into a small `tail` buffer. I only need the span that covers the last $N$ lines, so I keep reading earlier blocks until `tail` holds more than $N$ terminators — at which point the last $N$ lines are guaranteed fully present — or I reach the start of the file. Then I split **once** and take the last $N$ lines.
The point the naive version gets wrong is *when* to split. Splitting block-by-block while scanning backward is where carry bugs creep in (a line fragment from one block belongs to the line continuing in the *next, earlier* block). Assembling the raw byte span first and splitting once it provably covers $N{+}1$ lines sidesteps the carry entirely, and it is still bounded: the loop stops the moment the terminator count exceeds $N$, so `tail` never grows past the bytes of the last $N{+}1$ lines plus the one block that pushed the count over — never to file size.
```python
import os
def _count_terms(buf):
c, i, n = 0, 0, len(buf)
while i < n:
b = buf[i]
if b == 0x0A:
c += 1; i += 1
elif b == 0x0D:
c += 1; i += 2 if (i + 1 < n and buf[i + 1] == 0x0A) else 1
else:
i += 1
return c
def tail_seekable(path, n, block=65536, encoding="utf-8"):
if n <= 0:
return []
with open(path, "rb") as f:
f.seek(0, os.SEEK_END)
pos = f.tell()
if pos == 0:
return []
tail = b""
while pos > 0:
read = min(block, pos)
pos -= read
f.seek(pos)
tail = f.read(read) + tail # prepend the earlier block
if _count_terms(tail) > n: # last N lines now fully spanned
break
lines, trailing = _scan(tail)
if trailing: # unterminated final line (req. 6)
lines.append(trailing)
return [b.decode(encoding) for b in lines[-n:]]
```
The unterminated-final-line case (requirement 6) needs no separate probe: `_scan` already returns the bytes after the last terminator as `trailing`, and I append it as a real line exactly when it is non-empty. A file that *does* end in a terminator yields `trailing == b""`, so no phantom empty line is added. The `\r\n`-straddle hazard also disappears: the split happens once over a contiguous span, so a `\r` and its `\n` are always adjacent in `tail` and counted as one terminator.
- **Time:** $O(B)$ where $B$ is the number of bytes spanned by the last $N$ lines (we stop as soon as `tail` covers them). For a multi-GB log this is the difference between milliseconds and a full scan.
- **Space:** $O(N \cdot L + \text{block})$ where $L$ is max line length — `tail` is at most the last $N{+}1$ lines plus the one block that tipped the terminator count over. **Not** $O(\text{file size})$, which is the trap in the "ring-buffer the whole forward read" version on a huge seekable file.
**Why backward and not a forward ring buffer here?** A forward `deque(maxlen=N)` over the *whole* file is correct and simpler, but it is $O(\text{file size})$ time because it reads every byte. Backward scanning is the whole point of `tail` on a huge seekable file: touch only the tail.
---
### 2. Non-seekable stream — forward single pass, $O(N)$ memory
You cannot seek and you do not know the length, so a backward scan is impossible. Read forward **exactly once**, maintaining a `deque(maxlen=N)` of the last $N$ complete lines. Pushing the $(N{+}1)$-th line evicts the oldest in $O(1)$.
The only real work is incremental line splitting across reads: a line may span many reads (requirement 7), and a `\r\n` may straddle a read boundary (requirement 4). The straddle is the subtle one — when a read ends in a lone `\r`, I cannot yet tell whether the next read begins with `\n` (one `\r\n`) or with other bytes (a lone-`\r` line). So I hold the trailing `\r` back and let the next read decide. This guard must fire **whenever a chunk ends in `\r`**, which is exactly what the `buf.endswith(b"\r")` check below does — independent of chunk size.
```python
def _iter_lines_bytes(stream, chunk=65536):
"""Yield complete lines (bytes, terminator stripped) from a binary stream,
then the final unterminated line if present. Handles \n, \r\n, \r, a \r\n
split across reads, and lines larger than `chunk`."""
buf = b"" # bytes of the line currently being built
pending_cr = False # a \r closed `buf`; awaiting the \r\n test
while True:
data = stream.read(chunk)
if not data:
break
if pending_cr:
yield buf # the \r-terminated line
buf = b""
pending_cr = False
if data[:1] == b"\n": # the \r\n was split across this boundary
data = data[1:]
buf += data
if buf.endswith(b"\r"): # defer: could be \r or the \r of a \r\n
head = buf[:-1]
lines, trailing = _scan(head)
for ln in lines:
yield ln
buf = trailing # bytes between last terminator and the \r
pending_cr = True
else:
lines, trailing = _scan(buf)
for ln in lines:
yield ln
buf = trailing
if pending_cr: # stream ended on a lone \r
yield buf
elif buf: # final unterminated line (requirement 6)
yield buf
def tail_stream(stream, n, encoding="utf-8"):
if n <= 0:
return []
ring = deque(maxlen=n)
for line in _iter_lines_bytes(stream):
ring.append(line.decode(encoding))
return list(ring)
```
- **Time:** $O(\text{total bytes})$ — unavoidable; you must consume the whole stream to know which lines are last.
- **Space:** $O(N \cdot L)$ for the ring, plus $O(L)$ for the in-flight line / read chunk. Bounded by $N$ and one line, never by stream length.
---
### Cross-cutting correctness (mapping to each requirement)
- **(3) Memory bound.** Both paths cap live memory at the $N$-line buffer plus one block/line. Neither materializes the file: the backward path stops reading once it has spanned $N$ lines; the forward path keeps only a `deque(maxlen=N)`.
- **(4) Newline conventions.** `_scan` treats `\n`, lone `\r`, and `\r\n` as one terminator each. The straddle hazards are handled in the two places they can arise: backward, by splitting a single contiguous span so a `\r` and `\n` are always adjacent; forward, by deferring a trailing `\r` via `pending_cr` until the next read resolves whether a `\n` follows.
- **(5) UTF‑8 correctness.** I split on `\n`/`\r` *bytes*, which can never fall inside a multibyte sequence, then decode each complete line. I never decode a half-line in isolation — the backward path assembles the full byte span before any `decode`, and the forward path only decodes a line once its terminator (or EOF) has been seen. If you instead wanted to stream-decode, `codecs.getincrementaldecoder("utf-8")` buffers a partial trailing multibyte across reads — but byte-level line splitting makes that unnecessary here.
- **(6) Missing trailing newline.** Backward: `_scan` returns the post-last-terminator bytes as `trailing`, appended as a line iff non-empty (so a terminated file adds no phantom empty line). Forward: the final non-empty `buf` is yielded.
- **(7) Extremely long lines.** Nothing assumes one read == one line. Forward accumulates into `buf` until a terminator appears; backward keeps prepending earlier blocks until the terminator count exceeds $N$, so a giant line just makes `tail` (and $B$) large for that one line. Memory for such a line is $O(L)$, unavoidable since the line must be emitted whole.
---
### Follow-ups
**`tail -f` (follow growth).** After emitting the last $N$ lines, record the current offset. Then loop: prefer `inotify`/`kqueue` (event-driven, no busy polling); fall back to a short `sleep` + `stat`. On each wake, if `size > offset`, read `[offset, size)`, split into lines, print, advance `offset`. Carry an incomplete trailing fragment between iterations (the same `buf`/`pending_cr` state as the streaming reader) so a line written in two `write()`s is not printed twice or split.
**Log rotation / truncation / symlinks.** Each poll, `stat` the path and compare `(st_dev, st_ino)` against the open handle's:
- *Rotated* (inode/device changed — a new file took the name): finish reading the old handle to EOF, then `open()` the path afresh and reset `offset = 0`. This is GNU `tail -F` behavior.
- *Truncated* (same inode but `st_size < offset`): reset `offset = 0` and continue from the new start.
- *Symlinks:* decide policy explicitly. To follow the **target**, resolve the link once at open time. To follow whatever the **name** points to (rotation via symlink swap), re-resolve the path each poll and detect the inode change as above.
**`-c N` (last N bytes).** The easy case, and crucially **not** UTF‑8-aware — `-c` is defined in raw bytes and may legitimately start mid-character (matching GNU `tail -c`).
- *Seekable:* `seek(max(0, size - N), SEEK_SET)`; dump to EOF. $O(N)$ time, $O(1)$ seek.
- *Stream:* keep a byte ring (`deque(maxlen=N)` over single bytes, or a circular `bytearray` of length $N$ with a write cursor — the latter is far more efficient) and emit it at EOF. $O(\text{total bytes})$ time, $O(N)$ space.
---
### Complexity summary
| Mode | Time | Extra space |
|------|------|-------------|
| Seekable, last $N$ lines (backward) | $O(B)$, $B$ = bytes in last $N$ lines | $O(N \cdot L + \text{block})$ |
| Seekable via forward ring (simpler, slower) | $O(\text{file size})$ | $O(N \cdot L)$ |
| Non-seekable stream, last $N$ lines | $O(\text{total bytes})$ | $O(N \cdot L)$ |
| `-c N`, seekable | $O(N)$ | $O(1)$ |
| `-c N`, stream | $O(\text{total bytes})$ | $O(N)$ |
$L$ = max line length.
---
### Key test cases
- **Boundary $N$:** `N = 0` (empty output); file with fewer than $N$ lines (return all); exactly $N$ lines.
- **Empty file** (return `[]`, no crash).
- **No trailing newline** — final unterminated line must appear; a *terminated* file must not gain a phantom empty line.
- **Terminators:** pure `\n`; pure `\r\n`; pure lone `\r`; a mix in one file.
- **`\r\n` straddling a block boundary** (backward) and **a read boundary** (forward, including a chunk that ends in exactly one `\r`) — must count as one line, not two.
- **Line longer than the block/read size** — emitted intact.
- **Multibyte UTF‑8** lines whose bytes straddle a block boundary — characters never corrupted.
- **`tail -f`:** append while reading; then rotate (new inode) and truncate; verify re-open / offset reset.
- **`-c N`** where $N >$ file size (clamp to whole file) and where the cut lands mid-character (byte-exact, not character-aligned).
Explanation
Two regimes: a seekable file (read fixed-size blocks backward from EOF, counting newlines until N are found — O(bytes-in-last-N-lines), not O(file-size)) and a non-seekable stream (a forward single pass into an O(N) ring buffer such as deque(maxlen=N), since seeking and known length are unavailable). Cross-cutting correctness the interviewer probes: handle \n / \r\n / \r including a \r\n straddling a block boundary; slice bytes and decode UTF-8 only over complete spans (or use an incremental decoder) so multibyte characters are never cut; count an unterminated final line as a line; and handle lines larger than the read buffer. Follow-ups extend to tail -f (track offset + inode/size to detect rotation/truncation, re-resolve symlinks) and -c N for the last N raw bytes (trivial seek for files, byte ring buffer for streams, no UTF-8 boundary alignment).