Validation

We present a series of analyses to validate the in silico saturation mutagenesis approach implemented via boostDM models. We compare the performance of boostDM models in the classification of observed and synthetic mutations in 5 cancer genes with that of experimental saturation mutagenesis assays on the same genes. The comparison is extended across an important fraction of boostDM models to the performance of 8 bioinformatics methods aimed at identifying driver mutations or at assessing the functional impact of variants. Furthermore, we validate the predictions of the method using sets of observed and experimentally validated rare variants that are held out to re-train the models. We assess the ability of models trained and tested on a subset of the original data to classify the remaining held-out mutations, which constitute, in effect, an independent validation dataset. We also assess the ability of the method to reflect pathogenicity and oncogenicity in curated datasets with clinically relevant variants. Finally, we test the enrichment of positive predictions among lower frequency polymorphisms. Collectively, these tests demonstrate the satisfactory execution of boostDM models, which systematically outperform saturation mutagenesis experiments and other bioinformatics tools. In this document, we also describe the dependencies between the input data and the performance and complexity of the resulting models using random subsampling tests. The results of this experiment demonstrate that as more cancer sequencing data is available, more good-quality models will be attainable.

Hold-out with experimentally validated rare oncogenic variants

Rationale

We wanted to analyze the ability of the approach employed to train and test boostDM models to infer rules that can be subsequently applied to mutations that were never utilized at learning, i.e. the ability of the method to produce rules that can be extrapolated to biologically plausible, yet unobserved mutations. This is particularly relevant for assessing low-frequency variants of unknown significance.

Methodology

Given a probing set of missense mutations, we re-trained boostDM models by holding them out, then we assessed the ability of the re-trained models to correctly classify them. We carried out this analysis with two datasets separately: validated rare mutations reported in [Kim et al., 2016] and [Berger et al., 2016], respectively. These sets comprise mutations which have been both observed in tumors and experimentally validated.

Datasets of experimentally validated rare variants

Kim et al.: In [Kim et al., 2016] the authors validated a set of cancer mutations which comprises both high and low frequently observed mutations through in vivo tumor formation assays. Out of the positive set, 71 alleles belonging to the rarely observed category were validated by xenografts in NCR-Nu mice, as measured by the size of the tumors after 130 days. We used the “functional” (set of positive) and “neutral” (set of negative) labels as defined by the authors.
Berger et al.: In [Berger et al., 2016] the authors experimentally analyzed 194 mutations observed in lung adenocarcinomas, 69% of which were deemed as impactful via expression-based variant impact phenotyping (eVIP). Resistance to EGFR inhibition by erlotinib in two different concentrations was also measured. We labeled the mutation as “driver” when an eVIP-positive mutation also elicited resistance to EGFR inhibition in both conditions, and “passenger” when the allele was negative in both experiments.

Results

In computing the performance of the re-trained models on the pool of annotated missense mutations we conducted a separate analysis for oncogenes and tumor suppressors. When casting the predictions of re-trained models, only those mutations mapping to genes having a good quality models were assessed (364 variants) using the most specific tumor type models available.

Reported predictive abilities were generally high with slighlty better results for oncogenes:

Kim et al. missense mutations in oncogenes: 96 driver, 16 passenger; $F_{50}=0.95$.
Kim et al. missense mutations in tumor suppressors: 47 driver, 55 passenger; $F_{50}=0.72$.
Berger et al. missense mutations in oncogenes: 131 driver, 10 passengers; $F_{50}=0.92$.
Berger et al. missense mutations in tumor suppressors: 0 drivers, 12 passengers; NPV=0.92 (negative predictive value was used instead, as per the zero driver labels).

For a reduced set of genes we also computed their gene-wise performance: those genes for which at least one mutation in each driver and passenger class was reported. See Fig. 1h and Fig. S4a.

Sample-wise hold-out validation

Rationale

We wanted to test boostDM models on a setting close to the real-life scenario of interpretation of mutations in newly sequenced tumors. Such experiment would thus consist on leaving a fraction of the samples of a cohort of a tumor type out of those used in the training and test of the models (i.e., held-out) to use in their subsequent validation. Thus, we held out a random subset representing 10% of the samples from each cohort, the goal being to compare the performance of the re-trained models on the held-out data per gene-tumor type against the cross-validation performance of the original models trained with the full dataset. Specifically, holding out a subset of samples seeks to evaluate how the model is expected to perform in the classification of mutations identified in newly sequenced cancer samples. This is key for the in silico saturation mutagenesis –and by extension, for the interpretation of mutations in newly sequenced cancer genomes– which by definition consists in the classification of yet unobserved mutations.

Methodology

Among the mutations held out, we kept those that were included in the pool of mutations used for training of the full models – with their respective labels. Then we filtered out those mutations not mapping to any of the 185 gene-tumor types covered by good quality re-trained gene models. Considering unique mutations, about 66% were annotated as driver and the rest as passengers.

Matching the predictions of the retrained models with the driver/passenger annotations we could evaluate the hold-out performance ($F_{50}$) for 161 gene-tumor type pairs. We then compared them with the respective cross-validation performance in the full models.

Results

We compare the hold-out performance with re-trained models of TP53 against the cross-validation assessment yielded by the respective full models (see Fig. S2c). We also show the hold-out vs cross-validation across gene-tumor type pairs for oncogenes and tumor suppressors, separately.

The case of TP53 is paradigmatic of a driver gene with a moderate discovery index, implying a relatively low number of recurrent mutations in sample subsets, yet it has been intensely covered by sequencing, and is frequently affected by driver mutations across many tumor types. This makes it the gene with most good-quality models available for training and most stable under perturbations of the input datasets. Consequently, the observed mutations in TP53 are highly representative across several tumor types. The comparison between our 10% sample-wise hold-out and cross-validation in TP53 is consistent with the view that cross-validation yields a reasonable proxy for the expected performance in yet unobserved sets of mutations, with a comparable performance overall.

Looking across gene-tumor types, the results suggest that cross-validation is a conservative proxy of the sample-wise hold-out validation. This can be explained by the fact that cross-validation assessment reflects the typical performance of base classifiers prior to aggregation, which are expected to be sub-optimal models reflecting only partial classification rules. A segregated analysis within oncogenes and tumor suppressors indicate that the cross-validation may be slightly more conservative in the case of oncogenes (Fig. S2c).

Comparison against experimental and in silico functional scores

Rationale

Our aim in the development of boostDM consists in the site by site identification of potential drivers. We have framed it as a supervised classification problem that takes observed mutations in high excess consequence types as positive examples. Observed mutations in these high-excess cancer genes are the best proxy for driver mutations to use as validation dataset. In this regard, tumors may be considered as the only true real-life validation –natural experiment– of oncogenic mutations. Experimental approaches in general assess the effect of mutations in artificial experimental settings and do not measure directly oncogenic potential in human tissues. This is why we have used the observed –and randomized– mutations across tumors as positive and negative sets for validation of saturation mutagenesis approaches, both experimental and in silico (via boostDM models or via other bioinformatics tools aimed at the identification of driver mutations or at the assessment of the functional impact of variants).

Methodology

Bearing this in mind, the boostDM performance we report to compare with other approaches is derived by keeping as ground truth all the test mutations used for cross-validation of boostDM (described above) and measuring the agreement of the prediction cast by each base classifier on its corresponding set of test mutations. As for the other methods (experimental mutagenesis or in silico functional scores) we use the same set of ground truth mutations, keeping only those for which the corresponding method has a defined functional score.

We assessed the performance of a collection of mutation-specific functional scoring methods, comprising experimental saturation mutagenesis assays (TP53, RAS domain and PTEN) and in silico functional scores, then compared them against boostDM performance.

First we will introduce the methodology, then for sake of interpretation we provide a brief description of the methods themselves.

Comparison criterion

Although boostDM establishes a natural cutoff (0.5) to tell apart predicted drivers from passengers, none of the methods we compare with allows a straightforward dichotomous classification. Hence we evaluate their performance as the area under the precision-recall curve (auPRC) when we compare the method output against a validation driver-passenger-annotated dataset.

Validation Dataset

For a specific gene-tumor type context we defined the validation dataset as the balanced set comprising all test mutation instances across the 50 base base classifier splits. This choice allows an unbiased performance estimation of boostDM as we can evaluate the performance of each base classifier on their respective test mutations, which were never used for training. In other words, for each test set included in the validation pool, we got the boostDM forecasts with the base classifier resulting from the training set of the same split.

The other scores were mapped using genomic coordinates –and the tumor type of the model in case of the per-tumor-type CHASMplus models. Note that this approach may in principle overestimate the performance of the supervised learning-based bioinformatics tools we evaluate, as we do not prevent mutations that were used at their training from being part of the validation dataset. Finally, each method was evaluated only on the set of validation mutations where the score could be mapped.

Precision-recall curves

For each score we computed the precision-recall curve from the validation dataset using the Python sklearn.metrics.precision_recall_curve function, which required a re-scaling step to render all the scores in a 0-1 probability range. We accomplished this with a logistic regression step that used as a single covariate the score and as response the labels of the validation set. For fair comparison, the same step was conducted for all methods, including boostDM.

Continuous scores from a dichotomous perspective

For the sake of completeness, in the case of the TP53 (Kato et al. [Kato et al., 2003] and Giacomelli et al. [Giacomelli et al., 2018]) and PTEN (Mighell et al. [Mighell et al., 2018]) experimental saturation mutagenesis assays, CHASMplus and CADD, we also conducted the performance comparison against boostDM by using probing cutoffs that yielded a dichotomous forecast ($F_{50}$).

For TP53 (Kato et al.) we used a cutoff of 50 (a.u.) for the median intensities between the WAF1, MDM2, BAXnWT, h1433sn, AIP1, GADD45, NOXA, P53R2 reporters. For TP53 (Giacomelli et al.) we used the cutoff of 0 (A549 cells, p53NULL + etoposide Z-score) where values below 0 are positive. For PTEN (Mighell et al.) we used a cutoff of -2 (a.u.) for the log-scaled and wild-type normalized fitness scores. For CHASMplus we created two dichotomous classifications based on mild and stringent cutoff levels. These cutoffs were decided upon transformation of the score p-values into q-values by FDR correction across all the CHASMplus-annotated mutations mapping to the same gene-tumor type pair: the stringent (resp. mild) cutoff required a q-value=0 (resp. q-value $leq$ 0.01). For CADD we created a dichotomous classification based on the cutoff 10. We present the results after conducting the comparison for the entire validation set and for each test set separately (Fig. S3).

Experimental saturation mutagenesis assays

We compared the results between boostDM and the predictions that would be derived from an experimental saturation mutagenesis assay. Thus we parsed and mapped the outputs of several experimental saturation mutagenesis studies for TP53 [Giacomelli et al., 2018, Kato et al., 2003], mutations in the RAS domain [Bandaru et al., 2017], and PTEN [Mighell et al., 2018], then analyzed the concordance between these scores and boostDM across all the tumor types where a boostDM model was available.

Datasets

Kato et al. [Kato et al., 2003]: TP53 data. In the experiment, a site-directed mutagenesis was employed to construct all possible missense substitutions in the p53 protein (2569 nucleotide substitutions, 2314 different amino acid variants), followed by a promoter-specific transcriptional activity in yeast-based functional assays. Non functional sites were considered when less than the 50% wildtype transactivation was observed, as measured by the median of WAF1, MDM2, BAX, h1433s, AIP1, GADD45, NOXA and P53R2 factors.
Giacomelli et al. [Giacomelli et al., 2018]: TP53 data. TP53 saturation mutagenesis screens in an isogenic pair of TP53 wild-type and null cell lines. Library comprising 8258 mutant TP53 alleles, such that each allele would contain a single mutation, and each of the 20 natural amino acids and a stop codon would be represented at each codon position.
Bandaru et al. [Bandaru et al., 2017]: Ras function data. Analysis of a complete set of single-site mutations for the relevant variant of human H-Ras, expressed together with Raf-RBD, in a bacterial two-hybrid system.
Mighell et al. [Mighell et al., 2018]: PTEN data. Parallel approach in an artificial humanized yeast model to evaluate the impact of 7244 amino acid PTEN variants in the lipid phosphatase activity. The final score is defined by the mean of the functional score defined by the authors out of six experiments (two biological replicates with three technical replicates each).

Other comparisons

There are other experimental saturation mutagenesis studies that in principle would provide us with interesting comparisons to further validate and interpret our methodology. This is the case of a recent study on BRCA1 [Findlay et al., 2018]. However, for comparisons to render meaningful outcomes, we require our models to reach a minimum quality for the genes concerned. In Supp. Table 2 we provide specific information about the 2080 starting cancer gene-tissue combinations entering our pipeline. The numbers provided in this Table (in particular the label “effective mutations”, i.e., mutations available for training once the excess for each consequence type across each cohort of the tumor type is accounted for) constitute the rationale behind the model building flow diagram in Figure S1a. As of the preparation of this manuscript, specific models for some genes (e.g. BRCA1) did still not attain the minimum requirements: BRCA1 in Breast Adenocarcinoma only attains 7 observed mutations compatible with high excess that can be effectively used for training, thus an insufficient number to yield the 30 mutations required for the training sets. This precludes any comparisons until new somatic mutation data is available for training.

In silico functional scores

CHASMplus [Tokheim and Karchin, 2019]

We run CHASMplus with the following 15 tumor type specific models according to the TCGA ontology: Bladder urothelial carcinoma, Cervical squamous cell carcinoma and endocervical adenocarcinoma, Colorectal adenocarcinoma, Esophageal carcinoma, Gliobastoma, Acute myeloid leukemia, Liver hepatocellular carcinoma, Lung adenocarcinoma, Lung squamous cell carcinoma, Ovarian serous cystadenocarcinoma, Pancreatic adenocarcinoma, Prostate adenocarcinoma, Skin cutaneous melanoma, Thyroid carcinoma, Uterine corpus endometrial carcinoma.

CHASMplus scores missense mutations observed in pan-cancer and significantly mutated driver genes via a supervised learning approach employing a set of 95 features. CHASMplus generates models encompassing all cancer genes on a tissue specific basis. For each of the 185 gene-tumor type pair good quality models, we mapped the canonical transcript specific scores of CHASMplus onto the test mutations. Note that CHASMplus scores might in principle reflect the features of the test mutations employed in our testing setting, as they were not explicitly removed from its training set, which might bear overestimation of the method’s performance.

We also downloaded from dbNSFP repository (version 4.1, [Liu et al., 2020]) the precomputed scores for the following methods. Herein we briefly summarize the scope of each method:

CADD [Rentzsch et al., 2019]: Tool for scoring the deleteriousness of SNVs in the human genome, based on the distinction between simulated de novo variants and polymorphisms that have arisen and become fixed in human populations since the split between humans and chimpanzees.
SIFT and SIFT4G [Ng and Henikoff, 2003, Vaser et al., 2016]: Tools for scoring deleteriousness of missense amino acid substitutions based on the position and type of the amino acid change. From protein alignment information, SIFT calculates the probability that an amino acid at a position is tolerated, conditioning on the most amino acid being tolerated.
Polyphen-2 (HDIV and HVAR) [Adzhubei et al., 2013]: Predicts the stability and functional impact of amino acid substitutions in human proteins by exploiting structural and comparative evolutionary information.
FATHMM [Shihab et al., 2013]: Predicts the potentially deleteriousness of amino acid substitutions using a Hidden Markov Model-based probabilistic approach.
VEST4 [Carter et al., 2013]: Prioritizes likely pathogenic missense variants with a supervised learning based on the distinction between curated disease mutations and high allele frequency putatively neutral missense variants.
MutationAssessor [Reva et al., 2011]: Introduces a functional impact score for amino acid substitutions by exploiting evolutionary conservation patterns. These patterns are derived from aligned families and subfamilies of sequence homologs within and between species using a combinatorial entropy formalism.

Classification of manually curated pathogenic variants

We also used boostDM models to classify somatic pathogenic and germline benign somatic variants collected across the literature by ClinVar [Landrum et al., 2014], a database that annotates human variation with phenotypic consequence. Although these mutations are biased towards those that have been observed more recurrently, and their annotation as oncogenic do not necessarily need to be maintained across tumor types, we use them as a further exercise of validation of boostDM models.

For each mutation in these dataset, boostDM predictions were cast using all the tumor type models corresponding to genes where a sufficiently high quality gene-tumor type models was available. We measured the precision, recall and $F_{50}$ based on the ability of boostDM to correctly separate somatic pathogenic (ClinVar terms pathogenic and likely pathogenic) from germline benign (ClinVar terms benign and likely benings) variants.

Enrichment in low-frequency polymorphisms

We reasoned that applying boostDM models to germline variants detected across human populations should identify potential driver mutations depleted for frequent variants, since these have passed purifying selection. We downloaded all SNPs from gnomAD (release 2.1.1, [Karczewski et al., 2020]) and mapped them to genes with boostDM good quality model. We computed the driver/passenger boostDM class. We tested the association between boostDM class (response) against the log allele frequency of the SNPs with a logistic regression. For each gene-tumor type with mapped SNPs we computed the p-value and effect size (log-odds-ratio) representing the propensity of lower frequency values being more enriched in potential driver mutations according to boostDM.