BoostDM in detail

Implementation

Each boostDM model is an ensemble of base classifiers (n=50) each trained with a random sample subset (bagging) drawn from the learning dataset. Each base classifier is a boosted trees model (sum of regression tree functions trained with XGBoost gradient boosting implementation) with a cross-entropy loss function. The base classifiers are aggregated into a consensus model with an aggregator intended to correct for the systematic bias of each expert classifier (Section~ref{consensus}). The consensus model additive explanations are computed as the mean SHAP values base classifiers (Section~ref{explanations}).

The boostDM pipeline has been implemented in Python and Nextflow [Di Tommaso et al., 2017]. The models were implemented with the libraries xgboost [Chen and Guestrin, 2016] (version 0.90) to train the base classifiers and shap [Lundberg and Lee, 2017] (version 0.28.5) to compute the SHAP values associated with the predictions.

Base classifier hyperparameters

The model hyperparameters specify the learning strategy to keep a balance between loss minimization and generalization. While some hyperparameter values are the result of an explicit design decision, others cannot unequivocally defined. The current values stand as a compromise resulting from a testing series. For more information about hyperparameter tuning with XGBoost, please check xgboost documentation xgboost.readthedocs.io.

Herein we provide the current hyperparameter set-up:

Models are sums regression tree functions functions (booster = "gbtree")
The learning task minimizes a cross-entropy loss function (objective = "binary:logistic").
All the features are available when building each new tree (colsample_bytree = 1) and each new tree level (colsample_bylevel = 1).
Learning rate 0.001 (learning_rate = 0.001) was decided on by testing in combination with the maximum number of training steps.
Percentage of samples randomly drawn prior to growing a new tree is 70% at every iteration (subsample=0.7). This is a technical choice to further prevent overfitting.
Maximum depth of trees used in the tree function is 4 (max_depth=4). In practice a good performance can be attained even at depth as low as 1 (stumps). However, at a higher but still tolerable computational cost, more depth gives also more room for learning complex dependencies.
Default regularization hyperparameters were employed.
Maximum number of training steps is 20,000 (n_estimators = 20000).

Empirical evidence suggests that this set-up was a convenient choice in view of performance tests, expected capacity for the models to handle feature interactions and moderate training speed. Notwithstanding the importance of this technical setup, due to the fact that our base classifiers undergo additional aggregation, biases attributable to gradient boosting training will tend to moderate, resulting in an implicit form of regularization.

Oncotree: tumor type ontology

A specific tumor type determines the set of sequenced samples that contribute positive mutations to the learning dataset, the neutral mutation profile required to simulate passenger mutations and the value of some features.

We aim to develop models that are capable of classifying mutations across tumor types with different degrees of generality, the main reason being that for some specific gene-tumor type pairs the number of mutations is not sufficient to render a good model fitting, but the lack of mutations to train a gene-tumor type model can potentially be mitigated by pooling mutations from other samples of similar tumor types from a histopathological perspective.

We delineate a model selection strategy based on a hierarchical organization of tumor types that allows to conduct this pooling in a systematic way. Thus we defined a tumor type ontology (Oncotree) adapted from IntOGen [Martínez-Jiménez et al., 2020] that allows us to group samples according to several degrees of specificity regarding the tissue-pathology context where the mutations are reported. Then the root term CANCER is connected to two children terms, SOLID and NON_SOLID, which children connected at increasing levels of specificity. The leaves of this hierarchy define the most specific tumor-type terms considered in this study.

Each model trained by boostDM is relative to a gene and tumor-type pair, where the tumor type corresponds to a term in the Oncotree: we denote it as (G,T)-model for brevity.

Filtering

Consequence type: We restricted our analysis to single nucleotide substitutions annotated with either of the following Sequence Ontology~citep{SO} terms (according to VEP.92) with respect to the canonical transcript: splice-donor-variant, splice-acceptor-variant, splice-region-variant; missense-variant; stop-gained; stop-lost; synonymous-variant.
Multiple nucleotide variants: Adjacent point mutations were excluded from our analysis due to the plausible risk that these are misannotated.
De-duplication of samples: We removed duplicated samples (i.e., multiple samples from the same donor) for the training of models. If a cohort included duplicated samples, we selected the first according to its alphabetical order as in [Martínez-Jiménez et al., 2020]. For additional information about the pre-processing of samples, the reader can check https://intogen.readthedocs.io.

Mutational features

Given a point mutation in a driver gene and tumor type (either observed or randomized) our method requires an annotation of the mutation with a set of features to train the classifiers.

Consequence type: Every mutation was annotated with the following binary features, matching the obvious Sequence Ontology annotations: missense, nonsense (stop-gained; stop-lost), synonymous and splicing. For all observed and synthetic mutations in driver genes, we retrieved the annotations from VEP.92 for the canonical transcript.
Clusters of mutations in the DNA primary structure (linear clusters): For every mutation we annotated whether it overlaps a significant linear cluster identified by the method OncodriveCLUSTL [Arnedo-Pac et al., 2019]. We created two annotation tiers for mutations overlapping i) linear clusters found in a cohort of the corresponding tumor type (tumor type specific: cat_1), or ii) clusters only identified in cohorts belonging to other tumor types (pan-cancer: cat_2). Additionally we create another feature that represents the OncodriveCLUSTL score for linear cluster in the tumor type (i.e., cat_1).
Clusters of mutations on the protein 3D structure (3D clusters) identified by the method HotMAPS: [Tokheim et al., 2016] in a tumor type specific (cat_1) and pan-cancer (cat_2) way. For details of the HotMAPS implementation please refer to [Martínez-Jiménez et al., 2020].
The overlap with Pfam domains: Pfam domains [El-Gebali et al., 2019] that are significantly enriched for mutations in the gene across tumors of the cancer type, identified by the method smRegions [Martínez-Jiménez et al., 2020], in a tumor type specific (cat_1) and pan-cancer (cat_2) way.
Phylogenetic conservation: Conservation of the reference nucleotide across mammals measured through the PhyloP 100-way score [Pollard et al., 2010].
Post-translational modifications: Non-synonymous coding mutations were also annotated with post translational modifications (PTMs) when the mutation affected an amino acid that is known to be acetylated, phosphorylated, ubiquitinated, methylated or subject to any other regulatory modification according to PhosphositePlus [Hornbeck et al., 2015].
NMD: Whether nonsense mutations overlap the last coding exon of the canonical transcript (according to VEP.92) reflecting the potential for the truncating variant to skip nonsense-mediated RNA decay [Lindeboom et al., 2016].

Two of the features (PhyloP and the OncodriveCLUSTL score) were encoded in their original floating numerical value. The rest of the features (categorical) were given with a one-hot encoding.

Excess of mutations

The so-called excess of mutations for a given coding consequence-type quantifies the proportion of observed mutations at this consequence-type that are not explained by the neutral mutation rate. The excess is inferred from the dN/dS estimate \(\omega\) as \((\omega - 1)\ /\ \omega\). We computed the excess for missense, nonsense and splicing-affecting mutations.

Training

Training datasets

The first requirement for our supervised learning approach is to create a catalogue of mutations labeled as Drivers or Passengers. This catalogue is established globally for all the models, then for the training of each (G,T)-model only the mutations relevant to the (G,T) context are used.

Drivers

The set of Drivers used for training are observed mutations in mutational cancer genes (IntOGen) that exhibit a consequence-type specific excess of observed-over-expected mutations higher than 85% according to the method dNdScv [Martincorena et al., 2017]. Repeated mutations (i.e., when the same mutation is observed in different samples) are allowed in the set of Drivers.

Passengers

For each Driver mutation mapping to a gene and cohort of samples, we randomly select one mutation from the CDS (VEP.92 canonical transcript) following probabilities the tri-nucleotide specific probabilities recorded in the cohort (see Methods). Because we are bound to train 50 base classifiers, we do this random selection 50 times with replacement.

Data Splits

For each gene-tumor type pair (G,T), the corresponding base classifiers are obtained after conducting training with two sets of annotated mutations, namely Train and Test. We will refer to a Train-Test pair as a Split. In our setting each Split is generated randomly and must satisfy the following requirements:

Both Train and Test sets are Driver-Passenger balanced.
The Train set comprises 70% of Driver examples.
Each unique Driver instance belongs either to Train or Test.
Repeated Driver mutations (i.e., when the same mutation is observed in different samples) are allowed in Train, but not in Test (to prevent spurious inflation of the cross-validation performance evaluation).
Passenger mutations in Train and Test are randomly selected among all the mutations in the Passengers pool matching the (G,T) context.

We set the minimum average number of mutation instances in the Train size to deem the models trainable at 30.

Cross-validation and early stopping

When training a base classifier, we implemented sequential evaluation to determine the optimal number of estimators. We evaluate the performance after each learning step using the Test dataset. This sequential evaluation informs whether the training must stop due to steady or decreasing performance for a set of consecutive iterations (early stopping).

We use the cross-entropy (see scikit-learn.org/model_evaluation) as a performance measure to assess training progress sequentially with cross validation.

Given true labels \({\bf y} = \{y_i\}\) and forecasts \({\bf \hat y}=\{\hat{y}_i\}\), the cross-entropy is defined as:

\[J({\bf y}, {\bf \hat y}) = -\frac{1}{N}\sum_{i=1}^N \left[ y_i\log\hat y_i + (1-y_i)\log(1-\hat y_i)\right].\]

We define an early-stopping requirement of 2,000 iterations. Thus each expert classifier training finishes when either of the following conditions is first met: i) the maximum number of training steps (N=20,000 iterations) is attained; ii) the model’s performance on the Test data set does not improve for 2,000 consecutive iterations.

After the training we record the performance of the classifier via log-loss and a weighted F-score (\(F_{50}\)) when applied to the Test dataset. Although cross-validation and early stopping allow some extent of data leakage from the Test dataset, it turns out to provide a conservative performance evaluation compared to real performance in sample-wise hold-out tests.

Shapley Additive Explanations

Each base classifier yields an additive explanation model based on Shapley Additive Explanations (SHAP values) [Lundberg and Lee, 2017]. Specifically, each expert classifier can decompose the logit forecast yielded by a particular mutation m, say \(\textrm{logit } p(m)\) into a collection of SHAP values \(\{s_i(m)\}\), one per feature, in such a way that \(s_1(m) + \ldots + s_n(m) = \textrm{logit } p(m) \textrm{ for all } m\).

The intuition behind the concept of SHAP value is rooted and better understood in the context of cooperative game theory, where the Shapley value [Shapley, 1988] as originally described. If we think of \(\textrm{logit}(p)\) as a utility function that depends on the contribution of a coalition of features (thought of as players of a cooperative game) the SHAP values account for the relative contributions of the features by averaging the differences between: i) the expected forecast when the feature value is known and ii) the expected forecast when the feature value is not known.

More specifically, given an individual base classifier, denoted by B, the additive explanation model \(\mathcal{A}(B)\) for B is a map from the feature space \(F=F_1\times\ldots\times F_n\) (the set of arrays representing the possible values of the features) onto the euclidean space \(\mathbb{R}^n\) of dimension n equal to the number of features, which satisfies the following two requirements:

The additive explanation model \(\mathcal{A}(B)\) gives an additive decomposition of the logit forecast given by B, i.e. if \(m\in F\) denotes the array of feature values for some mutation instance m and p(m) denotes the logit prediction of the classifier on m, if \(\mathcal{A}(B)(z) = (\phi_1, \ldots, \phi_n)\) then \(p(z) = \phi_1 + \ldots + \phi_n\).
Given a feature array m for some mutation instance, the i-th component of \(\mathcal{A}(B)\) is an estimate of the average marginal payoff of the i-th feature over all possible feature coalitions, i.e.,

\(\phi_i(m) = \sum_{S\subseteq [n]\setminus \{i\}} K(S)\cdot (f(S\cup\{i\};m) - f(S;m))\)

where S runs through all possible subsets of the set of features excluding the i-th feature; and

\(K(S) = \frac{|S|! \cdot (n - |S|! - 1)}{n!}\)

is an averaging coefficient that depends on the size of S; and

\(f(U;m) = \textrm{E}[p(m)\;|\; m\in U]\)

is the conditional expectation of the model’s prediction for m, given that only the values of the features belonging to the subset U are known.

Aggregator of base classifiers

One challenge of our approach is that during training a fraction of the Driver labels may correspond to passenger mutations, even if this fraction will in general be lower than 15% (recall that we only consider those mutations matching a consequence-type specific dNdScv excess higher than 85%). This implies that each individual expert classifier may be at risk of being biased by the specific choices of Drivers made in the Splits. To work around this difficulty, we propose to use a pool of base classifiers, each trained with a different partial view of the data, so that the final forecast and explanation results from combining these classifiers, while also correcting for the systematic biases that they may bear.

Forecast aggregator

Our approach resorts to a non-linear combination of probabilities based on a logit-normal model [Satopää et al., 2014]. Specifically, if a classifier \(B_i\) casts a prediction \(p_i\) that a specific mutation is a driver and \(Y_i=\textrm{logit}(p_i)\), this modelling choice assumes that \(p_i\) arises from some latent true probability p that the mutation is a driver, which has in turn been moderated by some degree of systematic bias:

\[Y_i = \log\left(\frac{p}{1-p}\right)^{1/a} + \epsilon_i\]

with \(\epsilon_i \sim \textrm{N}(0,\sigma^2)\) being a normal random variable with standard deviation \(\sigma\) and \(a\geq 0\) represents the systematic bias.

Intuitively, systematic bias quantifies the extent to which individual classifiers regress towards a log-odds zero due to partial information or under-confidence while issuing their respective forecasts. While a=1 would be associated with an accurate forecast, a>1 represents under-confidence.

Using the previous modelling assumptions, given a collection of classifiers and their respective forecasts \(p_i\) and logits \(Y_i\), the latent true probability p can be estimated as \(\hat{p}\) with the following estimator:

\[\hat{p}(a) = \frac{\exp(a\overline Y)}{1 + \exp(a\overline Y)}\]

where \(\overline Y\) denotes the average of the logits \(Y_i\) and \(a\geq 1\) is the systematic bias.

The level of systematic bias gives us a method to sharpen the consensus forecasts of a coalition of under-confident predictors. In the interest of clarity and interpretability of the outcome, we require boostDM forecasts to have a marked bimodal tendendency to yield forecasts close to either 0 or 1. Upon testing the method in some exemplary gene-tumor type contexts, we committed to the choice a=2.3 for all models.

SHAP aggregator

The SHAP vector associated to a specific input mutation is computed as the feature-wise mean of the SHAP vectors cast across base classifiers.

Model Selection

As a principle, for each (G,T) context we can compute a classification model as an aggregator of base classifiers. A weak predictive power can come about either because there are not enough observed mutations to render a clear signal by means of regression with the current features, or because regardless of the number of mutations the set of mutational features is not sufficient to learn a distinct signal for the Drivers. A weak cross-validation predictive ability is a clear sign that the model is up to the learning problem, however a strong performance is not an unequivocal sign that the model is solving the task properly. Herein we discuss the criteria to deem a model sufficient and the rule to decide which (general) model to use given a target gene-tumor type pair.

Evaluation in test sets

Train-test splits give rise to base classifiers that are ultimately merged into a consensus classifier. These splits allow performance assessment by casting the predictions of the base classifiers trained with the split on the test set (cross-validation). The performance measure used to evaluate each partial classifier is a weighted F-score. The selection of the weighted F-score is motivated by the fact that some (up to 15% given one of the thresholds set for building models) “driver” mutations may actually be passengers (see below). From the 50 splits computed to generate the consensus classifier a global performance summary is derived taking the median of the F-scores.

F-score

To establish a simple and uniform performance criterion we resort to a weighted F-score that gives more importance to precision than to recall. We used an F-beta score with beta=0.5 as defined in citep{rijsbergen-1979}, which would measure the effectiveness of retrieval from the perspective of a user who attaches twice as much importance to precision as recall. We denote this as \(F_{50}\) throughout.

First, in assessing driverness, false negatives are in general more tolerable than false positives for a classifier with practical implications, as the real set of drivers is generally assumed to be much smaller than the set of all possible mutations that can be generated under neutral selection. Second, due to the intrinsically impure set of the mutations used for learning, some driver mutations in the test set are expected not to be true driver mutations. Consequently, a theoretically perfect classifier could not even attain perfect recall, hence recall must be undervalued in order to assess the method’s performance.

Representativeness

Intuitively, the performance criterion delineated in the previous section is not sufficient to establish the confidence of the assessment. For example, low confidence would follow if the repertoire of observed mutations upon which the model is trained are not representative of the set of potentially observable mutations in the gene-tumor type context. Therefore, the performance assessment must be endowed with a measure of representativeness of the observed mutation set.

To measure representativeness we make use of two properties: i) the propensity to observe the same mutations on repeated subsamples (measured by the discovery index, see online Methods) and ii) the number of mutations. We establish, as a rule, that to build a model representing a gene-tumor type with low discovery index, more observed mutations are required than those needed to build a model of a gene-tumor type with higher discovery index. Specifically, the following rule is followed: if the discovery index is higher than 0.2 at least 30 observed mutations are required to build the model; otherwise we require at least 40 observed mutations.

Model quality criteria

We deem the quality of the models sufficient if they meet the following two conditions:

Minimum training size: \(\geq 30\) training examples on average across training splits.
Representativeness: The set of observed mutations qualifies as a representative set; specifically, we require more observed mutations if the discovery index is lower, i.e., if N denotes the total number of observed mutations in a given gene and tumor type context, at least one of the following conditions must hold: \(N \geq 40\); or discovery index \(\geq 0.2\) and \(N \geq 30\); or discovery index \(\geq 0.4\) and \(N \geq 20\); or discovery index \(\geq 0.4\) and \(N \geq 10\).
Cross-validation performance: \(F_{50} \geq 0.8\) across base classifiers.

Model selection rule

We formulate the following model selection rule: for mutations mapping to gene G and cohort C, the model of choice will be the (G,T)-model where T is the most specific ancestor tumor type of C in the Oncotree such that the (G,T)-models meets the minimum quality criteria. If there is no such tumor type, we deem the forecast not feasible for that specific query.