A plausible-sounding data structure that probably doesn’t exist

Here is a description of an abstract data type: a confluently persistent set data type containing a subset of the integers in a range \([1..n]\) and supporting \(O(\log(n)^c)\) time creation, \(O(\log(n)^c)\) time size query, \(O(n \log(n)^c)\) time conversion to a duplicate-free list, and \(O(\log(n)^c)\) time merging, for some constant \(c\). Here, “creation” takes as arguments the range of the element domain \([1..n]\) and an optional single element in this range, and “merging” takes two sets with the same domain and returns a new set representing the union of the two input sets. In Haskell, the type signatures for these operations might look like this:

Seems fairly reasonable, right? I’m going to show that it is likely that no such data structure exists.

First, note that some very similar data structures do in fact exist. Haskell’s Data.Set can be used to implement this interface with \(O(1)\) singleton and size, \(O(\log(n))\) membership testing (which is obviously much more powerful than toList), and \(O(n)\) union. Brodal, Makris, and Tsichlas (2006) presented a purely functional data structure that has \(O(1)\) singleton, \(O(\log(n))\) membership testing, and \(O(1)\) “join“, which is the same as union but requires every element in the first set to be strictly less than every element in the second set.

So why is the variant above so implausible?

Argument 1: Fast transitive closure

If a MergeableSet data structure with the given time bounds exists (even without the size operation), then it is possible to find the transitive closure of an \(n\)-vertex graph in near-optimal time \(O(n^2 \log(n)^c)\).

The algorithms for computing the transitive closure of a graph with the current best known worst-case runtime are based on algorithms for fast matrix multiplication. In particular, transitive closure of a \(n\)-vertex graph can be computed in time \(O(n^\omega)\) where \(\omega < 2.373\) is the best known exponent for matrix multiplication. A faster algorithm for transitive closure would actually give a faster algorithm for Boolean matrix multiplication as well, as noted by Fischer and Meyer (1971).

Now, the problem of finding the transitive closures of a general graph can be reduced to the problem of finding the transitive closure of a directed acyclic graph. We can just compute the strongly connected components of the graph using any of the several linear-time algorithms, then compute the transitive closure of the resulting kernel DAG. Looping over the pairs of vertices in the original graph to move back to the starting domain takes \(O(n^2)\) time, but since the size of the output is \(n^2\) bits anyway there’s no additional asymptotic cost.

Suppose then that MergeableSet exists and we want to find the transitive closure of a DAG. We can associate to each vertex the set of vertices reachable from that vertex, stored as a MergeableSet. By traversing the graph in reverse topological order and using union to combine the sets of all of the vertices adjacent to each vertex, we can compute MergeableSets of reachable vertices for all vertices in \(O(n^2 \log(n)^d)\) time. Then we just loop over all \(n\) vertices and obtain their lists of reachable vertices using toList, which also takes \(O(n^2 \log(n)^d)\) time. A Haskell implementation of this idea (adding a slight \(O(\log(n))\) overhead by using Data.Map so that I don’t have to get mutable arrays involved) looks like this:

Argument 2: Faster Boolean satisfiability

If a MergeableSet data structure with the given time bounds exists (even without the toList operation), then Cnf-Sat, the Boolean satisfiability problem for formulas in conjunctive normal form, has a \(2^{\delta n} \cdot \text{poly}(m)\) algorithm for some \(\delta < 1\).

Pătrașcu and Williams (2010) gave several hypotheses under which Cnf-Sat would have substantially faster algorithms than brute-force search. One of their theorems is as follows: if a certain problem 2Sat+2Clauses can be solved in time \(O((n + m)^{2 – \epsilon})\) for any \(\epsilon > 0\), then Cnf-Sat with \(n\) variables and \(m\) clauses can be solved in time \(2^{\delta n} \cdot \text{poly}(m)\) for some \(\delta < 1\). They note in passing that 2Sat+2Clauses reduces in linear time to the following problem:

Given a directed graph \(G = (V, E)\) and subsets \(S, T \subseteq V\), determine if there is some \(s \in S\) and \(t \in T\) with no path from \(s\) to \(t\).

By computing the strongly-connected components of \(G\), we can again without loss of generality assume that \(G\) is acyclic.

Now suppose that MergeableSet exists. Then it is possible to solve this problem in time \(O((n + m) \cdot \log(n)^c)\) for a graph with \(n\) vertices and \(m\) edges. First, we compute the set of vertices in \(T\) reachable from each vertex, using essentially the same algorithm as the one for transitive closure from before. Then we loop over each vertex in \(S\) and use size to test whether the size of its reachable set is less than \(|T|\). If we find a vertex \(s\) where this is the case, then return true; otherwise, return false. (We can also find a specific vertex \(t\) with no path from \(s\) to \(t\) by depth-first search from \(s\).)

So, to summarize, MergeableSet would dramatically improve upon the known upper bounds for graph reachability problems. It’s probably too good to be true.