## Problem
You are given an array of strings `words`. You may choose **any subset** of these strings and concatenate the chosen strings in any order.
Your goal is to produce a concatenated string that contains the **maximum possible number of unique characters**, subject to the constraint that **no character appears more than once** in the concatenation.
Return **any valid concatenated string** (or equivalently, any subset) that achieves the maximum number of unique characters.
### Input
- `words: string[]`
### Output
- A string `s` such that:
- every character in `s` is unique (no character repeats), and
- `|s|` is maximized among all valid choices.
### Examples
- `words = ["un", "iq", "ue"]`
- One optimal answer: `"uniq"` (length 4). You cannot use all three, because `"un"` and `"ue"` both contain `u`.
- `words = ["aa", "bc", "d"]`
- `"aa"` cannot be used (it already repeats `a`), so an optimal answer is `"bcd"` (length 3).
```hint What makes a word usable
A word can only ever appear in the answer if it has **no internal duplicate characters** on its own. `"aa"` is invalid before you even combine it with anything. Filter these out first.
```
```hint Reframing the choice
Concatenation order never changes the *set* of characters produced. So the answer only depends on **which** words you pick — and a valid pick is a group of words whose character sets are **pairwise disjoint**.
```
### Clarifying Questions to Ask
- What is the **character domain** — guaranteed lowercase English letters only, or can words contain digits, uppercase, punctuation, or Unicode? (This decides whether a 26-bit mask is safe.)
- How large can `words.length` and the **total character count** get? (Determines whether exhaustive subset search is acceptable or whether a smarter prune is required.)
- Can `words` contain **empty strings** or **duplicate words**, and how should those be treated?
- If multiple subsets tie for the maximum, is **any** one acceptable, or is there a tie-break rule (e.g. lexicographically smallest)?
- Do I need to return the **string itself**, or just its length / the chosen indices?
### Constraints & Assumptions
- `1 <= words.length <= 16` (small enough that exhaustively considering every subset is on the table; confirm the bound).
- Total characters across all words is moderate (e.g. `<= 200`).
- Characters are lowercase English letters **unless stated otherwise** — call this assumption out, since the representation depends on it.
- Any valid maximum-length answer is accepted; concatenation order among chosen words is free.
```hint Sizing the search
Once a word is reduced to "use it or don't," nothing in between, ask yourself how many distinct ways there are to make that decision across all the words — and whether that count is small enough at the given `words.length` that you can afford to consider every combination. Let the size of that space, plus the disjointness condition from the previous hint, point you at how to organize the exploration and how to represent a word's characters so that checking compatibility against what you've already committed to is cheap.
```
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- How does your representation and runtime change if the alphabet is **not** lowercase letters (digits, uppercase, mixed case, or arbitrary Unicode)? What exactly breaks, and how do you fix it without changing the algorithm?
- A test harness reports that one **large dataset stalls** under your approach. How would you diagnose whether it's correct-but-slow versus stuck, and what would you tune to make the pruning fire harder?
- You're handed an existing **scoring/counting helper that returns a wrong number** (no crash). How would you localize the bug with a minimal reproducing input and an oracle?
- How would you reformulate this as an **iterative subset DP** instead of recursion, and what are the memory trade-offs versus the pruned recursive search?
Quick Answer: This question evaluates understanding of combinatorial search and string manipulation, specifically the ability to reason about unique-character constraints, subset selection, and set-like representations.
Solution
### Approach
First, restate the rules precisely, because two subtleties decide the whole solution:
1. **A word with an internal duplicate is unusable on its own.** `"aa"` can never be part of any valid answer, so it is discarded up front.
2. **You pick a *subset* of words and concatenate.** Concatenation order does not affect the set of characters, so we only care about *which* words are chosen and whether their character sets are pairwise disjoint.
That reduces the problem to: **choose a collection of words whose character sets are mutually disjoint, maximizing the total number of distinct characters.** Each word is "all or nothing" — you can't take part of a word.
With `words.length <= 16`, exhaustive subset search ($2^{16} = 65536$ subsets) is fully feasible. We make it fast with **bitmask backtracking and pruning**.
### Key idea: represent each word as a bitmask
If characters are lowercase English letters, each word becomes a 26-bit integer mask:
- A word has an internal duplicate **iff** `popcount(mask) != len(word)`. Drop those words.
- Two words are combinable **iff** `maskA & maskB == 0` (disjoint character sets).
- The score of a chosen set is `popcount(OR of all chosen masks)`, which equals the total length of the concatenation (since everything is unique).
So we never manipulate strings during the search — just integers and bitwise AND/OR.
### Backtracking solution
We walk the words in order. At each word we branch: **skip it**, or **take it if its mask is disjoint from what we've accumulated.** Track both the running mask and the running list of chosen words so we can rebuild the answer string at the end.
```python
def max_unique_concat(words):
# 1) Pre-filter: keep only words with no internal duplicate,
# storing (mask, word). 'ok and mask' also drops empty strings (mask 0).
cand = []
for w in words:
mask = 0
ok = True
for ch in w:
bit = 1 << (ord(ch) - ord('a'))
if mask & bit: # internal duplicate -> word unusable
ok = False
break
mask |= bit
if ok and mask: # skip empty strings too
cand.append((mask, w))
best = {"len": 0, "words": []}
def popcount(x):
return bin(x).count("1")
# Suffix OR of remaining masks, for an optimistic upper bound.
n = len(cand)
suffix_or = [0] * (n + 1)
for i in range(n - 1, -1, -1):
suffix_or[i] = suffix_or[i + 1] | cand[i][0]
def dfs(i, cur_mask, chosen):
# Prune: even if we took every disjoint bit remaining, can we beat best?
upper = popcount(cur_mask | suffix_or[i])
if upper <= best["len"]:
return
if i == n:
score = popcount(cur_mask)
if score > best["len"]:
best["len"] = score
best["words"] = list(chosen)
return
mask, word = cand[i]
# Branch 1: take this word (only if disjoint)
if cur_mask & mask == 0:
chosen.append(word)
dfs(i + 1, cur_mask | mask, chosen)
chosen.pop()
# Branch 2: skip this word
dfs(i + 1, cur_mask, chosen)
dfs(0, 0, [])
return "".join(best["words"])
```
The function returns a concatenated string with all-unique characters of maximum length, as required (concatenation order is arbitrary, so any order of `best["words"]` is a valid answer).
**Worked check on `["un","iq","ue"]`:** masks are `{u,n}`, `{i,q}`, `{u,e}`. `{u,n}` and `{u,e}` collide on `u`, so we can't take all three. Best disjoint set is `{u,n}` + `{i,q}` = 4 unique chars → e.g. `"uniq"`. On `["aa","bc","d"]`, `"aa"` is filtered out; `{b,c}` and `{d}` are disjoint → `"bcd"`, length 3.
### Why the pruning matters
The `upper <= best["len"]` check is the workhorse. `cur_mask | suffix_or[i]` is an **optimistic bound**: the most distinct characters still reachable if every remaining word were freely usable. If that can't beat the current best, the entire subtree is cut. Combined with discarding internal-duplicate words early, this turns the worst-case $O(2^n)$ search into something that finishes near-instantly for `n <= 16`.
### Complexity
- **Pre-filtering:** $O(L)$ where $L$ is total characters across all words.
- **Search:** worst case $O(2^n \cdot n / w)$ for the bitmask ops, but pruning makes the practical cost far lower. With `n <= 16` it is trivially fast.
- **Space:** $O(n)$ for recursion depth plus the suffix-OR array.
### Sub-problem 1: fixing a buggy counting method
The interview begins not with the solver but with an existing helper whose "counting" is wrong, and you are asked to find and fix the bug. Two failure modes show up over and over, and both produce a *plausible but wrong* number rather than a crash, which is exactly what makes them annoying:
1. **Counting total length instead of unique characters.** If the scorer does `sum(len(w) for w in chosen)` (or `len("".join(chosen))`) it counts characters, not *distinct* characters. The moment two chosen words share a letter the count overshoots the true number of unique characters. The fix is to score against a set: `len(set("".join(chosen)))`, or — equivalently and faster — `popcount(OR of masks)` as above.
2. **Not rejecting internal-duplicate words before counting.** A word like `"aa"` is invalid on its own, but a naive counter that adds `len("aa") == 2` to the running total double-counts the `a` and inflates the score. The fix is the same pre-filter used throughout this solution: keep a word only when `len(set(w)) == len(w)`.
How to *find* such a bug quickly: feed the method a tiny input whose answer you can compute by hand (`["ab","ba"]` should score 2, not 4; `["aa"]` should score 0, not 2) and compare. A one-line assertion against a brute-force reference on small inputs localizes it immediately:
```python
from itertools import combinations
def brute_best_len(words):
valid = [w for w in words if len(set(w)) == len(w) and w]
best = 0
for r in range(len(valid) + 1):
for combo in combinations(valid, r):
s = "".join(combo)
if len(set(s)) == len(s): # the whole concat must stay unique
best = max(best, len(s))
return best
```
`brute_best_len` is correct (if slow) and is the oracle you check the fixed method against.
### Generalizing the alphabet (the large "numbers" dataset)
The 26-bit trick **only works for a fixed small alphabet of lowercase letters.** Interview test suites for this problem often include datasets over digits, uppercase, mixed case, or Unicode — where a single `int` mask is no longer the right representation. The fix is to keep the same algorithm but swap the state type:
- Represent each word's character set as a **`frozenset`** (and reject internal duplicates with `len(set(w)) == len(w)`).
- Disjointness: `not (curset & wordset)` (or `curset.isdisjoint(wordset)`).
- Union: `curset | wordset`; score is `len(curset)`.
```python
def max_unique_concat_general(words):
cand = []
for w in words:
s = set(w)
if len(s) == len(w) and s: # no internal duplicate, non-empty
cand.append((frozenset(s), w))
n = len(cand)
suffix_union = [frozenset()] * (n + 1)
for i in range(n - 1, -1, -1):
suffix_union[i] = suffix_union[i + 1] | cand[i][0]
best = {"len": 0, "words": []}
def dfs(i, cur, chosen):
if len(cur | suffix_union[i]) <= best["len"]:
return
if i == n:
if len(cur) > best["len"]:
best["len"] = len(cur)
best["words"] = list(chosen)
return
s, word = cand[i]
if cur.isdisjoint(s):
chosen.append(word)
dfs(i + 1, cur | s, chosen)
chosen.pop()
dfs(i + 1, cur, chosen)
dfs(0, frozenset(), [])
return "".join(best["words"])
```
This keeps the exact same prune (suffix-union upper bound) but works for any character domain. It is the version you want when "the numbers dataset doesn't pass."
**What actually goes wrong with the bitmask on a non-lowercase dataset — stated precisely.** The bitmask does *not* silently produce a wrong-but-running answer, and it cannot "alias" two characters onto the same bit. `ord(ch) - ord('a')` is a *bijection* from codepoints to shift amounts, so two distinct characters always land on distinct bits — collisions are impossible by construction. The real failure modes are:
- **A loud crash on any character below `'a'`** (digits `'0'`–`'9'`, uppercase `'A'`–`'Z'`, and punctuation like `' '` or `'-'`). For those, `ord(ch) - ord('a')` is *negative*, and `1 << (negative)` raises `ValueError: negative shift count` in Python. The "numbers dataset" trips this on the first digit — the run aborts with a traceback rather than returning a wrong answer.
- **A fragile, non-portable (but still arithmetically correct) result for characters above `'z'`.** Python integers are arbitrary precision, so a codepoint above `'z'` simply sets a higher bit and the algorithm still computes the right answer; the trouble is only that you have silently left "26-bit lowercase" land, and any fixed-width-int port of the same code (C/Java `long`) would overflow or wrap. So the lowercase-bitmask is unsafe to *rely on* outside its stated domain.
The state-as-set form removes both concerns: it never indexes by codepoint, so there is nothing to crash and no width assumption to outgrow.
### Debugging a backtracker that stalls on a big dataset
The interview note specifically asked how to debug this. Practical levers, in order:
- **Cap the search and instrument it.** Add a counter of `dfs` calls; if it explodes, your pruning isn't firing. Print `best["len"]` over time to see whether it plateaus (correct but slow) vs. never improves (likely a bug).
- **Verify the pre-filter.** A common bug is forgetting to reject internal-duplicate words, which both inflates work and can produce invalid answers.
- **Check for an exception swallowing the run.** On a digits/uppercase dataset the *lowercase bitmask* aborts with `ValueError: negative shift count` on the first such character — if a harness catches and reports that as "didn't pass," switch to the `frozenset` state before assuming the algorithm is too slow.
- **Strengthen ordering.** Sort candidate words by descending set size before the search so high-value words are tried first; `best` rises faster and the upper-bound prune cuts more subtrees.
- **Deduplicate masks.** Identical character sets are redundant; keeping one representative shrinks the branching factor.
### Alternative: subset DP
If you prefer iteration over recursion, build the set of all reachable disjoint-character bitmasks (or sets) incrementally:
```python
def max_unique_concat_dp(words):
masks = []
for w in words:
s = set(w)
if len(s) == len(w) and s: # no internal duplicate, non-empty
masks.append((frozenset(s), w))
# reachable: map from accumulated frozenset -> list of words producing it
reachable = {frozenset(): []}
for s, word in masks:
for cur in list(reachable.keys()):
if cur.isdisjoint(s):
new = cur | s
if new not in reachable:
reachable[new] = reachable[cur] + [word]
best = max(reachable, key=len)
return "".join(reachable[best])
```
This is the standard LeetCode 1239 DP. The `and s` non-empty guard keeps empty strings out of `masks`, matching the filtering the other variants do. It's clean but can hold many states in memory; the pruned backtracker is usually faster and lighter when `n` is up to ~16. Both return a maximum-length all-unique concatenation.
### Edge cases
- **Empty input or all words invalid** (every word has an internal duplicate): the answer is the empty string `""`, length 0.
- **Empty strings inside `words`:** contribute nothing; filtered out in every variant (mask `0` → dropped by `ok and mask`; empty set → dropped by the `and s` guard).
- **Single best word vs. combination:** the optimal answer may be one long word rather than several short ones — the search compares both because skipping is always a branch.
- **Ties:** many subsets can reach the maximum; any one is accepted per the problem statement, so returning the first best found is fine.
- **Mixed/large alphabet:** use the `frozenset` version; do not assume lowercase-only unless the problem guarantees it. The lowercase bitmask will crash with `ValueError: negative shift count` on any character below `'a'`.