Suppose we are given a language $Sigma$ where, suppose, $|Sigma| = O(1)$. Consider two fixed strings $A, B in Sigma^n$. Define the Hamming metric between these strings as

$$d_{H}(A,B) = sum_{i=1}^n boldsymbol{1}lbrace A(i) neq B(i)rbrace$$

If we define $B^{(k)}$ as the $k$-shift (to the right) cyclic permutation of $B$, then what I am looking to compute is

$$d_{text{cyc},H}(A,B) = min_{k in lbrace 0, cdots, n-1 rbrace} d_Hleft(A, B^{(k)}right)$$

So it is easy to see that we can compute $d_H(A,B)$ for some length $n$ strings $A$ and $B$ in time $O(n)$, implying a trivial $O(n^2)$ algorithm for $d_{text{cyc},H}(A,B)$. So my goal is to see if we can do something better. If someone knows of an algorithm that generalizes to any constant value for $|Sigma|$, I would be happy to know. For now, I will lay out some of my thoughts.

Suppose that $|Sigma| = 2$, namely that $Sigma = lbrace alpha, beta rbrace$. Let us define a map $h: Sigma rightarrow lbrace -1, 1 rbrace$ where, say, $h(alpha) = -1$ and $h(beta) = 1$. If we transform the strings $A$ and $B$ element-wise to strings $A’$ and $B’$ in $lbrace -1, 1rbrace^n$, we can then compute all of the $d_Hleft(A, B^{(k)}right)$ values via a FFT of the concatenated string $B’B’$ and $A’$. We can see this by first considering the computation of $d_H(A,B)$. Suppose $I_{=} subseteq (n)$ is the set of indices for characters where $A$ and $B$ are the same and make $I_{neq} = (n) setminus I_{=}$ the set of indices where $A$ and $B$ differ. Clearly $I_{=}$ and $I_{neq}$ are disjoint, so $|I_{=}| + |I_{neq}| = n$. Now let us compute the inner product of $A’$ and $B’$. Any element where $A$ and $B$ have the same character, $A’$ and $B’$ will have the same sign at that element. Any element where $A$ and $B$ differ, the signs will differ as well. Thus we find that

$$(A’ cdot B’) = sum_{i=1}^n A'(i) B'(i) = sum_{i in I_=} A'(i) B'(i) + sum_{i in I_{neq}} A'(i) B'(i) = |I_=| – |I_{neq}|$$

As $d_H(A,B) = |I_{neq}|$ and $(A’cdot B’) = |I_{=}| – |I_{neq}| = n – 2 |I_{neq}|$, this implies that we can find $d_H(A,B)$ to be equal to

$$d_H(A,B) = |I_{neq}| = frac{1}{2}left(n – (A’ cdot B’)right)$$

Now if $text{rev}(S)$ reverses a string $S$ of size $n$, implying that $S(i) = text{rev}(S)(n-i)$, we can observe that if we define the string $C’ = text{rev}(B’B’)$, we can find for any $k in (n)$ that

begin{align}

v_k &:= sum_{i=1}^n C'((n-k+1)-i)A'(i)\

&= sum_{i=1}^n (B’B’)((k-1) + i)A'(i) \

&= sum_{i=1}^n (B’)^{(k-1)}(i) A'(i) \

&= left((B’)^{(k-1)} cdot A’right) \

&= n – 2 d_Hleft( A, B^{(k-1)} right)

end{align}

This implies doing the convolution of the strings $C’$ and $A’$ give us a mechanism to compute all values for $d_Hleft(A, B^{(k)}right)$, which can be done in $O(n log(n))$ time using the Fast Fourier Transform (FFT). This sounds great for the special case that $|Sigma| = 2$, but I am unsure about an efficient, exact way that generalizes to larger constant values for the size of $Sigma$.

My initial thought as an approximation is to create, say, an $r$-wise independently family of hash functions $mathcal{H} := leftlbrace h: Sigma rightarrow lbrace -1, 1 rbrace ,|, forall c in Sigma, h(c) = 1 text{ with prob } 1/2rightrbrace$ for $r$ at least 2, uniformly sample some $h in mathcal{H}$, and then for a string $A in Sigma^n$ set $A'(i) = h(A(i))$. If we define the random variable $Y(A,B) = A’ cdot B’$ under this type of transformation, we can find that

begin{align}

mathbb{E}left(Y(A,B)right) &= sum_{i=1}^n mathbb{E}left(A'(i)B'(i)right) \

&= sum_{i in I_{=}} mathbb{E}left( A'(i)B'(i)right) + sum_{i in I_{neq}} mathbb{E}left(A'(i)B'(i)right)

end{align}

Consider two characters $a, c in Sigma$. If $a = c$, then $mathbb{E}(h(a) h(c)) = mathbb{E}(h(a)^2) = mathbb{E}(1) = 1$ since $h(a) = h(c)$. If $a neq c$, then $mathbb{E}(h(a) h(c)) = mathbb{E}(h(a)) mathbb{E}(h(c)) = 0$. This result implies that

begin{align}

mathbb{E}left(Y(A,B)right) &= sum_{i in I_{=}} mathbb{E}left( A'(i)B'(i)right) + sum_{i in I_{neq}} mathbb{E}left(A'(i)B'(i)right) \

&= |I_{=}| \

&= n – |I_{neq}|

end{align}

Which means that technically we could use the estimator $hat{d}_H(A,B) = n – Y(A,B)$. Obviously we could then average across $k$ estimates to minimize variance, but at least initial calculations of the variance of this estimator seem to show that the variance satisfies $text{Var}(hat{d}_H(A,B)) = Theta(n^2)$, which kind of makes sense because there are hash functions that could completely get things wrong. Like if we happen to choose a hash function such that $h(c) = 1$ for all $c in Sigma$, then we will get an estimate that the strings are identical even if the strings have no overlap, e.g. $A = aaa$ and $B = bbb$. Thus, this randomized approach does not seem sound. If anyone has ideas of how things could be modified to improve the concentration properties, I would be happy to hear them!