Suppose you have $n$ nonequal sets $S_1, ldots, S_n$ and some constant $0 le k < n$. The goal of *set clustering* is to find a partition of the set $mathbf{S} = {S_1, ldots, S_n}$ such that the sum of the *total distance* for each subset of $mathbf{S}$ is minimized and such that $mathrm{cardinality}(mathbf{S}) = k$ (in reality, this latter constraint is not quite so tight, but the size of $mathbf{S}$ must be less than $k$, and hopefully near it). The *total distance* $d$ of a set $X in mathbf{S}$ is $d(X) = sum_{A, B in X} mathrm{cardinality}(A ominus B)$ where $ominus$ is symmetric difference. Assume for the purposes of the problem that the sets consume $O(1)$ space (so symmetric difference can be computed in constant time).

Is there a good greedy linear(ithmic) heuristic for this problem? Is there any literature on this problem, or similar ones?

All I’ve come up with so far is an $O(n^2 log n)$ heuristic that looks like:

- Set $I = {1, ldots, n}$
- Choose some $i in I$
- Emit a cluster $C subset I$ containing all $m in I$ such that $mathrm{cardinality}(S_i ominus S_m) le d_mathrm{max}$.
- Set $I = I – C$
- Go to step 2.

where this process occurs in each step of a binary search that finds the best value of $d_mathrm{max}$ for a given $k$.

I was thinking that if there were some way to sort the list of sets such that nearby sets in the list have small symmetric difference, then a linearithmic solution might be easy to write.