Discovery index

For the sake of completeness, we provide the rationale behind the exponential regression function we used at defining the discovery index.

A simple urn model

Consider an urn containing N different mutations that can occur in some gene. Each mutation has in addition one of two flavours: driver or passenger. There are d driver and p passenger mutations, hence N=p+d. If we sample n times uniformly with replacement one mutation at a time from the urn, let’s denote E(n) the expected number of driver mutations drawn at least one time. We aim to provide an expression to this function.

There are two trivial base cases: E(0)=0 and E(1)=d/N. For the general case n>1 observe that we can exploit the following recurrence:

\[E(n) = E(n-1) + \frac{1}{n}\cdot (d - E(n-1)) = \frac{N-1}{N}\cdot E(n-1) + \frac{d}{N}\]

i.e., the expectation at step n is the expectation at step n-1 plus the probability to draw a mutation that remained unobserved after step n-1. Letting a=(N-1)/N this recurrence yields the following expression for E(n):

\[E(n) = \frac{d}{N}\cdot\frac{1-a^{n}}{1-a} = d\cdot\left[1 - \left(\frac{N-1}{N}\right)^n \right]\]

Suppose that the probability \(\delta\) to draw any particular driver is uniform across drivers, but not necessarily equal to 1/N. We can build a similar recurrence:

\[E(n) = E(n-1) + \delta\cdot (d - E(n-1)) = (1-\delta)\cdot E(n-1) + \delta\cdot d\]

Whence the following expression follows if we let \(a=1-\delta\):

\[E(n) = \delta\cdot d\cdot\frac{1-a^{n}}{1-a} = d\cdot\left[1 - (1-\delta)^n \right].\]

For a small enough \(\delta\), notice that the following approximation holds:

\[E(n)\approx d\cdot[1-\exp(-\delta\cdot n)]\]

All in all, the previous expressions can be both put into exponential form:

\[E(n) \approx d\cdot [1 - \exp(\beta\cdot n)]\]

where \(\beta\) can represent either \(\log((N-1)/N)\) or \(\log(1-d)\), depending on the case.

This is the expression we used to define an index to measure the extent to which more driver mutations are to be discovered provided more sequenced tumors. Intuitively, d is the asymptotic value that the function E(n) would take for n large, i.e., the total number of driver mutations to be discovered in the Urn. Notice also that the smaller is \(\beta\) the faster the exponential term approaches zero, i.e., the fewer samples are required to approach the asymptotic value.