PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/Coding & Algorithms/Confluent

Implement tail N lines

Last updated: Jun 15, 2026

Quick Overview

Confluent software-engineer technical-screen coding question: implement tail(path, N) to print the last N lines of a possibly multi-GB file or an unbounded stream without loading it into memory. It tests a backward block scan for seekable files versus an O(N) ring buffer for non-seekable streams, plus correct handling of \n/\r\n/\r newlines, UTF-8 multibyte characters, missing trailing newlines, and extremely long lines, with follow-ups on tail -f, log rotation/truncation, and the -c last-N-bytes option.

  • Medium
  • Confluent
  • Coding & Algorithms
  • Software Engineer

Implement tail N lines

Company: Confluent

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Technical Screen

##### Question Implement a utility `tail(path: string, N: int)` (and an equivalent stream variant) that prints the **last N lines** of input without loading the whole input into memory. Assume variable-length lines, an unknown total line count, and a file that may be many gigabytes. Your solution must address both input modes and the full set of correctness constraints below: 1. **Seekable / random-access file.** Use either a backward scan (seek to the end and read fixed-size blocks toward the front, counting newlines until you have collected N lines) or a fixed-size ring buffer over a forward read. Avoid re-reading the whole file. 2. **Non-seekable / unbounded stream (stdin or a pipe).** Seeking is impossible and the length is unknown, so maintain only **O(N)** memory — a ring buffer of the last N lines — while reading forward exactly once. 3. **Memory bound.** Never read the entire file into memory; memory must be bounded by N (and one line), not by file size. 4. **Newline conventions.** Correctly handle Unix `LF` (`\n`), Windows `CRLF` (`\r\n`), and classic Mac `CR` (`\r`) line endings, including a mix. 5. **UTF-8 correctness.** Split on line boundaries at the byte level but preserve UTF-8 multibyte characters — never cut a character in the middle when reading or buffering blocks. 6. **Missing trailing newline.** A final line not terminated by a newline must still count as a line and be emitted. 7. **Extremely long lines.** A single line may exceed your buffer/block size; handle lines larger than the read buffer without corruption. 8. State the **time and space complexity** of each approach and give the **key test cases** you would write. **Follow-ups the interviewer will push on:** - Extend to `tail -f`: follow the file as it grows, emitting new lines as they are appended. - Handle **log rotation / truncation** and **symlinks** (detect when the file is replaced or truncated and re-open / reset position). - Support the `-c N` option: print the last N **bytes** instead of the last N lines.

Quick Answer: Confluent software-engineer technical-screen coding question: implement tail(path, N) to print the last N lines of a possibly multi-GB file or an unbounded stream without loading it into memory. It tests a backward block scan for seekable files versus an O(N) ring buffer for non-seekable streams, plus correct handling of \n/\r\n/\r newlines, UTF-8 multibyte characters, missing trailing newlines, and extremely long lines, with follow-ups on tail -f, log rotation/truncation, and the -c last-N-bytes option.

Solution

The whole problem turns on one branch: **can you seek?** A regular file is seekable, so you can jump to the end and walk *backward*, paying only for the bytes spanned by the last $N$ lines. A pipe / stdin is not seekable and its length is unknown, so you must read *forward* exactly once and keep only the last $N$ lines in an $O(N)$ ring buffer. Both keep memory bounded by $N$ (plus one line), never by file size. The hard part is not the high-level strategy — it is getting the byte-level details right: mixed `\n` / `\r\n` / `\r`, a `\r\n` straddling a block boundary, multibyte UTF‑8 never cut mid-character, an unterminated final line still counting, and a single line larger than the read buffer. ### Design decision: work in bytes, decode last I detect line boundaries on the **raw byte stream**, then UTF‑8-decode each finished line. This is the key invariant that makes UTF‑8 safety free: a newline (`0x0A`) or carriage return (`0x0D`) is a single ASCII byte that can never appear *inside* a UTF‑8 multibyte sequence (continuation bytes are all `0x80–0xBF`). So splitting on `\n` / `\r` at the byte level can never land in the middle of a character — I only have to make sure each emitted line is decoded as a complete byte span. Both modes share one splitter. `_scan(buf)` returns the complete lines closed by a terminator inside `buf`, plus the trailing fragment after the last terminator (the start of the next, not-yet-closed line): ```python # Split a byte span on \n, \r\n, and lone \r (each counts as ONE terminator). # Returns (complete_lines, trailing_fragment). trailing is the bytes after the # final terminator, or b'' if the span ends exactly on one. def _scan(buf): lines = [] i, start, n = 0, 0, len(buf) while i < n: b = buf[i] if b == 0x0A: # \n lines.append(buf[start:i]); i += 1; start = i elif b == 0x0D: # \r, maybe \r\n lines.append(buf[start:i]) i += 2 if (i + 1 < n and buf[i + 1] == 0x0A) else 1 start = i else: i += 1 return lines, buf[start:] ``` --- ### 1. Seekable file — backward block scan, $O(N)$ memory Seek to EOF and read fixed-size blocks (64 KiB) moving toward the front, **prepending** each block into a small `tail` buffer. I only need the span that covers the last $N$ lines, so I keep reading earlier blocks until `tail` holds more than $N$ terminators — at which point the last $N$ lines are guaranteed fully present — or I reach the start of the file. Then I split **once** and take the last $N$ lines. The point the naive version gets wrong is *when* to split. Splitting block-by-block while scanning backward is where carry bugs creep in (a line fragment from one block belongs to the line continuing in the *next, earlier* block). Assembling the raw byte span first and splitting once it provably covers $N{+}1$ lines sidesteps the carry entirely, and it is still bounded: the loop stops the moment the terminator count exceeds $N$, so `tail` never grows past the bytes of the last $N{+}1$ lines plus the one block that pushed the count over — never to file size. ```python import os def _count_terms(buf): c, i, n = 0, 0, len(buf) while i < n: b = buf[i] if b == 0x0A: c += 1; i += 1 elif b == 0x0D: c += 1; i += 2 if (i + 1 < n and buf[i + 1] == 0x0A) else 1 else: i += 1 return c def tail_seekable(path, n, block=65536, encoding="utf-8"): if n <= 0: return [] with open(path, "rb") as f: f.seek(0, os.SEEK_END) pos = f.tell() if pos == 0: return [] tail = b"" while pos > 0: read = min(block, pos) pos -= read f.seek(pos) tail = f.read(read) + tail # prepend the earlier block if _count_terms(tail) > n: # last N lines now fully spanned break lines, trailing = _scan(tail) if trailing: # unterminated final line (req. 6) lines.append(trailing) return [b.decode(encoding) for b in lines[-n:]] ``` The unterminated-final-line case (requirement 6) needs no separate probe: `_scan` already returns the bytes after the last terminator as `trailing`, and I append it as a real line exactly when it is non-empty. A file that *does* end in a terminator yields `trailing == b""`, so no phantom empty line is added. The `\r\n`-straddle hazard also disappears: the split happens once over a contiguous span, so a `\r` and its `\n` are always adjacent in `tail` and counted as one terminator. - **Time:** $O(B)$ where $B$ is the number of bytes spanned by the last $N$ lines (we stop as soon as `tail` covers them). For a multi-GB log this is the difference between milliseconds and a full scan. - **Space:** $O(N \cdot L + \text{block})$ where $L$ is max line length — `tail` is at most the last $N{+}1$ lines plus the one block that tipped the terminator count over. **Not** $O(\text{file size})$, which is the trap in the "ring-buffer the whole forward read" version on a huge seekable file. **Why backward and not a forward ring buffer here?** A forward `deque(maxlen=N)` over the *whole* file is correct and simpler, but it is $O(\text{file size})$ time because it reads every byte. Backward scanning is the whole point of `tail` on a huge seekable file: touch only the tail. --- ### 2. Non-seekable stream — forward single pass, $O(N)$ memory You cannot seek and you do not know the length, so a backward scan is impossible. Read forward **exactly once**, maintaining a `deque(maxlen=N)` of the last $N$ complete lines. Pushing the $(N{+}1)$-th line evicts the oldest in $O(1)$. The only real work is incremental line splitting across reads: a line may span many reads (requirement 7), and a `\r\n` may straddle a read boundary (requirement 4). The straddle is the subtle one — when a read ends in a lone `\r`, I cannot yet tell whether the next read begins with `\n` (one `\r\n`) or with other bytes (a lone-`\r` line). So I hold the trailing `\r` back and let the next read decide. This guard must fire **whenever a chunk ends in `\r`**, which is exactly what the `buf.endswith(b"\r")` check below does — independent of chunk size. ```python def _iter_lines_bytes(stream, chunk=65536): """Yield complete lines (bytes, terminator stripped) from a binary stream, then the final unterminated line if present. Handles \n, \r\n, \r, a \r\n split across reads, and lines larger than `chunk`.""" buf = b"" # bytes of the line currently being built pending_cr = False # a \r closed `buf`; awaiting the \r\n test while True: data = stream.read(chunk) if not data: break if pending_cr: yield buf # the \r-terminated line buf = b"" pending_cr = False if data[:1] == b"\n": # the \r\n was split across this boundary data = data[1:] buf += data if buf.endswith(b"\r"): # defer: could be \r or the \r of a \r\n head = buf[:-1] lines, trailing = _scan(head) for ln in lines: yield ln buf = trailing # bytes between last terminator and the \r pending_cr = True else: lines, trailing = _scan(buf) for ln in lines: yield ln buf = trailing if pending_cr: # stream ended on a lone \r yield buf elif buf: # final unterminated line (requirement 6) yield buf def tail_stream(stream, n, encoding="utf-8"): if n <= 0: return [] ring = deque(maxlen=n) for line in _iter_lines_bytes(stream): ring.append(line.decode(encoding)) return list(ring) ``` - **Time:** $O(\text{total bytes})$ — unavoidable; you must consume the whole stream to know which lines are last. - **Space:** $O(N \cdot L)$ for the ring, plus $O(L)$ for the in-flight line / read chunk. Bounded by $N$ and one line, never by stream length. --- ### Cross-cutting correctness (mapping to each requirement) - **(3) Memory bound.** Both paths cap live memory at the $N$-line buffer plus one block/line. Neither materializes the file: the backward path stops reading once it has spanned $N$ lines; the forward path keeps only a `deque(maxlen=N)`. - **(4) Newline conventions.** `_scan` treats `\n`, lone `\r`, and `\r\n` as one terminator each. The straddle hazards are handled in the two places they can arise: backward, by splitting a single contiguous span so a `\r` and `\n` are always adjacent; forward, by deferring a trailing `\r` via `pending_cr` until the next read resolves whether a `\n` follows. - **(5) UTF‑8 correctness.** I split on `\n`/`\r` *bytes*, which can never fall inside a multibyte sequence, then decode each complete line. I never decode a half-line in isolation — the backward path assembles the full byte span before any `decode`, and the forward path only decodes a line once its terminator (or EOF) has been seen. If you instead wanted to stream-decode, `codecs.getincrementaldecoder("utf-8")` buffers a partial trailing multibyte across reads — but byte-level line splitting makes that unnecessary here. - **(6) Missing trailing newline.** Backward: `_scan` returns the post-last-terminator bytes as `trailing`, appended as a line iff non-empty (so a terminated file adds no phantom empty line). Forward: the final non-empty `buf` is yielded. - **(7) Extremely long lines.** Nothing assumes one read == one line. Forward accumulates into `buf` until a terminator appears; backward keeps prepending earlier blocks until the terminator count exceeds $N$, so a giant line just makes `tail` (and $B$) large for that one line. Memory for such a line is $O(L)$, unavoidable since the line must be emitted whole. --- ### Follow-ups **`tail -f` (follow growth).** After emitting the last $N$ lines, record the current offset. Then loop: prefer `inotify`/`kqueue` (event-driven, no busy polling); fall back to a short `sleep` + `stat`. On each wake, if `size > offset`, read `[offset, size)`, split into lines, print, advance `offset`. Carry an incomplete trailing fragment between iterations (the same `buf`/`pending_cr` state as the streaming reader) so a line written in two `write()`s is not printed twice or split. **Log rotation / truncation / symlinks.** Each poll, `stat` the path and compare `(st_dev, st_ino)` against the open handle's: - *Rotated* (inode/device changed — a new file took the name): finish reading the old handle to EOF, then `open()` the path afresh and reset `offset = 0`. This is GNU `tail -F` behavior. - *Truncated* (same inode but `st_size < offset`): reset `offset = 0` and continue from the new start. - *Symlinks:* decide policy explicitly. To follow the **target**, resolve the link once at open time. To follow whatever the **name** points to (rotation via symlink swap), re-resolve the path each poll and detect the inode change as above. **`-c N` (last N bytes).** The easy case, and crucially **not** UTF‑8-aware — `-c` is defined in raw bytes and may legitimately start mid-character (matching GNU `tail -c`). - *Seekable:* `seek(max(0, size - N), SEEK_SET)`; dump to EOF. $O(N)$ time, $O(1)$ seek. - *Stream:* keep a byte ring (`deque(maxlen=N)` over single bytes, or a circular `bytearray` of length $N$ with a write cursor — the latter is far more efficient) and emit it at EOF. $O(\text{total bytes})$ time, $O(N)$ space. --- ### Complexity summary | Mode | Time | Extra space | |------|------|-------------| | Seekable, last $N$ lines (backward) | $O(B)$, $B$ = bytes in last $N$ lines | $O(N \cdot L + \text{block})$ | | Seekable via forward ring (simpler, slower) | $O(\text{file size})$ | $O(N \cdot L)$ | | Non-seekable stream, last $N$ lines | $O(\text{total bytes})$ | $O(N \cdot L)$ | | `-c N`, seekable | $O(N)$ | $O(1)$ | | `-c N`, stream | $O(\text{total bytes})$ | $O(N)$ | $L$ = max line length. --- ### Key test cases - **Boundary $N$:** `N = 0` (empty output); file with fewer than $N$ lines (return all); exactly $N$ lines. - **Empty file** (return `[]`, no crash). - **No trailing newline** — final unterminated line must appear; a *terminated* file must not gain a phantom empty line. - **Terminators:** pure `\n`; pure `\r\n`; pure lone `\r`; a mix in one file. - **`\r\n` straddling a block boundary** (backward) and **a read boundary** (forward, including a chunk that ends in exactly one `\r`) — must count as one line, not two. - **Line longer than the block/read size** — emitted intact. - **Multibyte UTF‑8** lines whose bytes straddle a block boundary — characters never corrupted. - **`tail -f`:** append while reading; then rotate (new inode) and truncate; verify re-open / offset reset. - **`-c N`** where $N >$ file size (clamp to whole file) and where the cut lands mid-character (byte-exact, not character-aligned).

Explanation

Two regimes: a seekable file (read fixed-size blocks backward from EOF, counting newlines until N are found — O(bytes-in-last-N-lines), not O(file-size)) and a non-seekable stream (a forward single pass into an O(N) ring buffer such as deque(maxlen=N), since seeking and known length are unavailable). Cross-cutting correctness the interviewer probes: handle \n / \r\n / \r including a \r\n straddling a block boundary; slice bytes and decode UTF-8 only over complete spans (or use an incremental decoder) so multibyte characters are never cut; count an unterminated final line as a line; and handle lines larger than the read buffer. Follow-ups extend to tail -f (track offset + inode/size to detect rotation/truncation, re-resolve symlinks) and -c N for the last N raw bytes (trivial seek for files, byte ring buffer for streams, no UTF-8 boundary alignment).

Related Interview Questions

  • Implement Tail and Find Monster Cost - Confluent (medium)
  • Solve Signature, File, and Queue Problems - Confluent (medium)
  • Process pod logs with global increments and pop-min - Confluent (easy)
  • Implement File Tail and Sensor Health - Confluent (medium)
  • Rank songs by pairwise user preferences - Confluent (medium)
|Home/Coding & Algorithms/Confluent

Implement tail N lines

Confluent logo
Confluent
Sep 6, 2025, 12:00 AM
MediumSoftware EngineerTechnical ScreenCoding & Algorithms
32
0
Question

Implement a utility tail(path: string, N: int) (and an equivalent stream variant) that prints the last N lines of input without loading the whole input into memory. Assume variable-length lines, an unknown total line count, and a file that may be many gigabytes.

Your solution must address both input modes and the full set of correctness constraints below:

  1. Seekable / random-access file. Use either a backward scan (seek to the end and read fixed-size blocks toward the front, counting newlines until you have collected N lines) or a fixed-size ring buffer over a forward read. Avoid re-reading the whole file.
  2. Non-seekable / unbounded stream (stdin or a pipe). Seeking is impossible and the length is unknown, so maintain only O(N) memory — a ring buffer of the last N lines — while reading forward exactly once.
  3. Memory bound. Never read the entire file into memory; memory must be bounded by N (and one line), not by file size.
  4. Newline conventions. Correctly handle Unix LF ( \n ), Windows CRLF ( \r\n ), and classic Mac CR ( \r ) line endings, including a mix.
  5. UTF-8 correctness. Split on line boundaries at the byte level but preserve UTF-8 multibyte characters — never cut a character in the middle when reading or buffering blocks.
  6. Missing trailing newline. A final line not terminated by a newline must still count as a line and be emitted.
  7. Extremely long lines. A single line may exceed your buffer/block size; handle lines larger than the read buffer without corruption.
  8. State the time and space complexity of each approach and give the key test cases you would write.

Follow-ups the interviewer will push on:

  • Extend to tail -f : follow the file as it grows, emitting new lines as they are appended.
  • Handle log rotation / truncation and symlinks (detect when the file is replaced or truncated and re-open / reset position).
  • Support the -c N option: print the last N bytes instead of the last N lines.

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More Coding & Algorithms•More Confluent•More Software Engineer•Confluent Software Engineer•Confluent Coding & Algorithms•Software Engineer Coding & Algorithms
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.