*I realize this is long, but hopefully I think it may be worth the reading for people interested in combinatorics and it might prove important to Covid-19 testing.*

**0. Introduction**

The starting point of this question (with multiple subquestions, let me know if this should be made Community Wiki) is this important article by Mutesa et al. where a hypercube ${0,1,2}^n$ is used for pooling (pools are cardinal hyperplanes) takings for Covid-19 testing. This tests $3^n$ takings with $3n$ tests, and gives you exactly who is positive as long as no more than 1 among $3^n$ is. There $3$ is shown to be optimal in term of compression of information, and there are extra strategies for the case when $2$ or more positives show up. However, this limitation in the subsets of positives that can be characterized makes this pooling design only usable at low prevalence. Moreover, even at low prevalence it could be that completely different designs would prove efficient.

I have written a draft sketching some possible research directions, and I would like to share here the main point and ask what seem to me to be the main questions. It might be better to set up a Polymath project, but I don’t feel I have the skills (I am not a combinatorician myself) nor the proper network to make it work.

**1. Definitions**

Assume we are given $N$ takings from patients to be analyzed for presence of a virus by RT-PCR. An important feature of RT-PCR in this respect is the low sensitivity to dilution: up to some extent and up to the usual limits in sensitivity and specificity, a pool where several takings are mixed will be positive if and only if at least one of the patients has the disease. The simplest pooling method is to gather (fractions of) takings in pools of $k$, and analyze each pool. If the result is negative, all takings are negative, and if the result is positive then takings are reanalyzed individually; one is left with optimizing $k$ using an estimation of the prevalence in the patients. The issue with this method is that successive rounds increase the time to have the result, so ideally one would like to parallelize the tests as much a possible, ideally in only one round.

Recall a *hypergraph* is a pair $(V,E)$ where $V$ is a set (whose elements are called vertices) and $E$ is a set of non-empty subsets of $V$ (whose elements are called edges).

**Definition** Given a vertex $xin V$, let $x^*$ be the set of edges containing $x$. Given a subset $Xsubset V$ of vertices, let $X^*={ein E mid exists xin X, xin e}$ be the set of all edges incident to some element of $X$.

Let us define a *pooling design* as a hypergraph $(V,E)$ satisfying the following property:

$$forall xin V, forall Xsubset V, quad x^* = X^* implies X={x}$$

The interpretation is as follows: each vertex corresponds to a patient (or rather their taking), and each edge to a pool where fractions of takings will be mixed to be analyzed together. The condition that $x^*=X^*$ should only occurs when $X={x}$ ensures that, whenever there is at most one positive taking, its uniqueness is guaranteed by the tests and it can be identified.

Given a pooling design $(V,E)$, there are several numbers of importance to analyze its practical interest in pooled testing. Among them we mention:

- its
*order* $v=lvert Vrvert$, i.e. the number of patients considered in one application of the design (to analyze $N$ takings, we need $vge N$ up to filling some vertices with empty takings, or to divide $N$ into several groups of at most $v$ patients),
- its
*size* $e=lvert Ervert$, i.e. the number of RT-PCR involved in one application of the design,
- its
*compression rate* $r=frac{e}{v}$, i.e. the factor applied to the number of RT-PCR involved compared to the individual testing — the smaller, the better,
- its
*detection capacity*, i.e. the maximal number of positive taking that can be guaranteed and identified. Formally, letting $mathcal{P}_{le n}(V)$ be the set of subsets of $V$ with at most $n$ elements, we set

(c = max big{ncolon forall X,Yin mathcal{P}_{le n}(V), X^*=Y^*implies X=Y big}.)

The definition of a pooling design ensures $cge 1$, but larger is better.

**2. A compression rate is limited by relative detection capability**

**Proposition.**

Let $(V,E)$ be a pooling design of order $v$, size $e$ and detection capacity $c$. Then

$$e ge v Hbig(frac{c}{v}big)-frac12log_2(c)-frac32$$

where $H(x):= -xlog_2(x)-(1-x)log_2(1-x)$ is the binary entropy function.

In particular, the compression rate satisfies

$$r ge Hbig(frac{c}{v}big) – o_{vtoinfty}(1) $$

In particular, since the prevalence at which the pooling design can be used is mostly driven by $c/v$, we have a quantitative estimate of how much a large prevalence prevents a good compression rate.

The proof is straightfoward, and sketched in the draft.

**3. Examples**

**Example 1.** The individual testing consist in taking $V$ the set of all $N$ takings, and $E=big{{x} colon xin Vbig}$: each edge is a single vertex. Then we have

begin{align*}

v &= e = N & r &= 1 & c &= N

end{align*}

**Example 2.** The hypercube design of (Mutesa et al. 2020) with dimension $Dge2$ consist in taking $V={1,2,3}^D$ and $E$ the set of coordinate slices, i.e.

$$E=bigcup_{k=1}^D big{{1,2,3}^{k-1}times {i}times{1,2,3}^{D-k} colon iin{1,2,3}big}.$$

It has

begin{align*}

v &= 3^D & e &= 3D & r &= frac{D}{3^{D-1}} & c &= 1

end{align*}

The detection capacity has the good property to not depend on a prior assumption on the number of positives (if there are more than one, the design detects this event), and while for large $D$ the detection capacity is low compared to the order, subsequent rounds of testing can be used efficiently. The main question we want to raise is whether it is possible to improve on this state-of-the-art pooling design.

Comparing $H(c/v)$ and the actual compression rate for the hypercube design with various values of $D$ show some limited room for improvement (see the draft): the hypercube is of by only a factor less than $2$; these pooling designs are thus not too far from optimal in their prevalence regime.

**Example 3.** The complete quadrilateral can be described with $V={1,2,3,4,5,6}$ and $E=big{ {1,2,3}, {3,4,5}, {5,6,2}, {1,4,6} big}$. It has

begin{align*}

v &= 6 & e &= 4 & r &= frac23 & c &= 1

end{align*}

For comparison, we note that $H(c/v) simeq 0.65$, very close to the compression rate: this pooling design is close to optimal in its prevalence regime (although the small $v$ makes it more likely that more takings than expected turn out positive; concretely, the probability to have more than $1$ positive taking drops below $5%$ at prevalence $le 6.28%$).

Other examples from incidence geometry are given in the draft.

**Example 4.**

Let $p$ be a prime number (or a primitive number) and $mathbb{F}_p$ be the Field with $p$ elements. Choose a dimension $Dge 2$ and a parameter $kge D$. We set $V=mathbb{F}_p^D$ (for $p=3$, we thus have the same vertex set than in the hypercube design). Let $(phi_1,dots,phi_k)$ be linear forms such that any $D$ of them are linearly independent. Without loss of generality, we can assume $(phi_1,dots,phi_D)$ is the canonical dual basis (i.e. $phi_i(x_1,dots,x_D) = x_i$). Last, we let $E$ be the set of all levels of all the $phi_i$:

$$ E = big{phi_i^{-1}(y) colon iin{1,dots, k}, yinmathbb{F}_p big}.$$

Let us call the pooling design $(V,E)$ the *generalized hybercube design* of parameters $(p,D,k)$. It has

begin{align*}

v &= p^D & e &= kp & r &= frac{k}{p^{D-1}}

end{align*}

and the remaining question is how large can be $c$. I think that for $k=D+1$, $cge2$.

For example, with $p=3$, $D=5$ and $k=6$ this gives $H(c/v)ge 0.068$ and $rsimeq 0.074$, a relative improvement from the hypercube design of the same dimension. With $p=2$, $D=5$ and $k=6$, we get a moderate-order pooling design ($v=32$) with $H(c/v)ge 0.33$ and nearly optimal $r= 0.375$ that can be used up to prevalence up to $2,6%$ with less than $5%$ probability to exceed the detection capacity.

**4. Questions**

**General Question** Which values of $v,r,c$ are realized by a pooling design?

**Question 1.** Determine $c$ for the generalized hypercube design, in particular confirm that $cge 2$ for $k>D$ (it might be that $c$ depends on the specific linear form chosen, although I would bet a low stake that it does not). Given $v_0$, what choice of $p,D,k$ such that $vsimeq v_0$ minimizes $frac{r}{H(c/v)}$? Given a prevalence, what is the best value of $r$ that can be achieved with a generalized hypercube for which detection capacity is exceeded with probability less than $5%$?

**Question 2.** Does there exist pooling designs with $vgg 1$, $c/v simeq 1/6$ and compression rate $simeq2/3$?

**Question 3.** For small values of $v$, give all pooling designs that are optimal in the sense that no other pooling design with the same order has both better compression rate and better detection capability.