Coding Interview PatternsDelete Duplicate Folders in System
HardTrie

Delete Duplicate Folders in System

Explanation & Solution

Description

Due to a bug, there are many duplicate folders in a file system. You are given a 2D array paths, where paths[i] is an array representing an absolute path to the i-th folder in the file system.

For example, ["one", "two", "three"] represents the path /one/two/three.

Two folders (not necessarily at the same level) are identical if they contain the same non-empty set of identical subfolders and underlying subfolder structure. Two folders are considered identical if their serialized subtree structures are the same.

If two or more folders are identical, then mark the folders as well as all their subfolders. Once all duplicate folders have been marked, the file system will delete all of them. The file system only runs the deletion once, so any folders that become identical after the deletion are not deleted.

Return the 2D array ans containing the paths of the remaining folders after deleting all the marked folders. The paths may be returned in any order.

Input: paths = [["a"],["c"],["d"],["a","b"],["c","b"],["d","a"]]

Output: [["d"],["d","a"]]

Explanation: Folders /a and /c both have a subfolder b, making them identical. They are deleted. Folder /d has subfolder a (not b), so it remains.

Constraints

  • 1 <= paths.length <= 2 * 10^4
  • 1 <= paths[i].length <= 500
  • 1 <= paths[i][j].length <= 10
  • 1 <= sum(paths[i][j].length) <= 2 * 10^5
  • paths[i][j] consists of lowercase English letters
  • No two paths lead to the same folder
  • The paths do not contain a path that is a prefix of another where the longer path has the exact same intermediate components

Approach

Trie pattern

1. Build the Folder Trie

  • Sort paths and insert each path into a trie
  • Each trie node represents a folder and has a children map of subfolder names to child nodes

2. Serialize Each Subtree

  • Use post-order DFS to compute a canonical string for each node's subtree
  • A leaf node (no children) gets an empty string ""
  • A non-leaf node serializes as the sorted concatenation of "(" + childName + childSerialization + ")" for each child
  • This produces a unique string for each unique subtree structure

3. Detect Duplicates

  • Store each non-empty serialization in a hash map pointing to all nodes with that serialization
  • If two or more nodes share the same serialization string, their subtrees are identical
  • Mark all such nodes as deleted

4. Collect Remaining Paths

  • DFS from the root, skipping any node marked deleted
  • For each non-deleted node, record its path in the result

5. Return the Result

  • The result contains all paths that were not part of any duplicate subtree group

Key Insight

  • Subtree serialization converts the structural comparison problem into a string comparison problem
  • Two folders are duplicates if and only if their serialized subtree strings are identical
  • Time complexity is **O(N * L * log(L))** where N is total path components and L accounts for sorting children during serialization

Solution Code