BoostDM overview

Training set

boostDM is based on a supervised learning approach using a training set of driver and passenger mutations in cancer genes. How to define such training datasets is not trivial, as there is not a complete ground truth collection of driver and passenger point mutations in a cancer gene.

For some genes, the excess of observed-over-expected mutations is large enough that the vast majority of observed mutations are involved in tumorigenesis. We propose (and validate) that if there are enough observed mutations in cancer driver genes with consequence type above a certain excess will provide both sufficiently many mutations and high enough driver enrichment to render good discriminative ability by standard supervised classification techniques. Thus we define as positive set (drivers) the set of mutations with excess of observed-to-expected higher than 85% according to dNdScv [Martincorena et al., 2017].

From a theoretical perspective, passenger mutations would be any other mutations. For discriminative efficiency, however, training data must reflect the fact that passenger mutations are randomly generated following tri-nucleotide specific mutation rates (neutral mutational profile). Thus a collection of simulated mutations according following the neutral mutational profile will be used as a negative set (Passengers).

Features

Each mutation provided for training is annotated with a vector of mutational features, which the classification task exploits to discriminate between observed drivers and passengers in tumours. Some mutational features of each cancer gene across malignancies have been derived from the systematic analysis of tens of thousands of tumor samples by IntOGen [Martínez-Jiménez et al., 2020]. Other relevant features have been collected from public databases: VEP.92 [McLaren et al., 2016], PhosphositePlus [Hornbeck et al., 2015] and phyloP [Pollard et al., 2010].

Learning heuristics

For each gene-tumor type pair, the learning heuristics makes use of gradient boosted trees as the base classifier. Several base classifiers are trained on subsets (bagging) of the pool of learning mutations. Then these classifiers are aggregated into a consensus model. Alongside the training, this approach yields an out-of-bag (cross-validation) assessment of the model performance. This evaluation strategy is expected to give a conservative estimation of the performance, as it reflects the typical performance of the base classifiers before aggregation.

Scope

As a principle, boostDM provides a distinct model for each driver gene-tumor type pair. In some cases, however, the models cannot be successfully rolled out due to the fact that there are not enough mutations for training or even if there are, cross-validation yields low accuracy. Within each gene, boostDM covers the protein coding sequence of the genome.

The effect of all mutations being considered are relative to the canonical transcript of protein coding genes according to the Ensembl Variant Effect Predictor version 92 (VEP.92 [McLaren et al., 2016]). Notice that these transcripts may include mutations in untranslated regions which are splicing affecting, even if they are non-protein-altering mutations.

Prediction

For each cancer gene and tumor type our method fits a model representing feature rules that define driver mutations in that context. Specifically, the method yields a score \(0\leq p \leq 1\) that reflects the strength of the forecast that the mutation is a potential driver: the higher the score, the stronger the evidence. Although p is not calibrated to support a probabilistic interpretation, a score >0.5 reflects a predominant evidence in favour of the mutation being a potential driver.

Explanation

Each base classifier (gradient boosted tress) admits an additive explanation model that decomposes the logit prediction of each individual mutation as the sum of so-called SHAP values associated to the features [Lundberg and Lee, 2017]. These are explanatory values in the sense that a feature having positive (resp. negative) SHAP value implies that the method deems more likely (resp. unlikely) that the mutation is a driver conditioned to the feature’s value (see Shapley Additive Explanations). We define the SHAP values of the aggregated model as the mean SHAP values across base classifiers.