Pipeline

Availability

A public version of the pipeline is available in this git repo.

Pre-requisites and running

Follow the instructions of the repo README.

Input

The full pipeline feeds on several data sources, mainly coming from IntOGen pipeline, but also others. A full installation of IntOGen is required to properly handle the full preprocessing of mutations and features that ultimately yields the training data.

For the public version we provide a simplified pipeline that starts with a preprocessed input that summarizes the training data alongside all the necessary feature annotations, driver genes and cohort data.

regression data

create_datasets/<cohort-code>.regression_data.tsv

Collection of positive and negative training mutations alongside their features for a given cohort.

For each mutation it encompasses the following information:

chr:

chromosome

pos:

position in genomic coordinates (hg38)

ref:

reference allele

gene:

gene

alt:

alternate allele

PhyloP:

PhyloP conservation score,

aachange:

amino acid change

nmd:

whether a stop mutation maps to the last coding exon

Acetylation, Methylation, Phosphorylation, Regulatory_Site, Ubiquitination:

whether the mutation maps to a residue subject to post-translational modification.

CLUSTL_SCORE, CLUSTL_cat_1, CLUSTL_cat_2:

oncodriveCLUSTL scores

motif:

whether the mutation maps to a significantly enriched Pfam domain according to smRegions,

csqn_type_missense, csqn_type_nonsense, csqn_type_splicing, csqn_type_synonymous:

simplified protein coding consequence type

HotMaps_cat_1, HotMaps_cat_2:

HotMAPs features (tumor type specific and pan-cancer)

smRegions_cat_1, smRegions_cat_2:

smRegions features (tumor type specific and pan-cancer)

role_Act, role_LoF:

oncogenic mode of action of the gene (oncogene or tumor-suppresor)

response:

whether the mutation is labeled as positive or negative mutation for supervised learning

saturation annotation files

saturation/annotation/<tumor-type>/<gene>.annotated.out.gz

Comprehensive catalogue of all the mutations mapping to the canonical transcript for each gene in each tumor type context according to the results of IntOGen pipeline, VEP and PhosphositePlus.

For each mutation it encompasses the following information:

chr, pos

hg38 genomic coordinates

alt

alternate allele

ENSEMBL_GENE, ENSEMBL_TRANSCRIPT, Feature_type, cDNA_position, CDS_position, Protein_position, Amino_acids, Codons Existing_variation, IMPACT, DISTANCE, STRAND, FLAGS, gene, SYMBOL_SOURCE, HGNC_ID, CANONICAL, ENSP, EXON, INTRON

customary VEP annotations, a description can be found here: VEP feature description

boostDM features:

PhyloP, aachange, nmd, Acetylation, Methylation, Phosphorylation, Regulatory_Site, Ubiquitination, CLUSTL_SCORE, motif, csqn_type_missense, csqn_type_nonsense, csqn_type_splicing, csqn_type_synonymous, CLUSTL_cat_1, CLUSTL_cat_2, HotMaps_cat_1, HotMaps_cat_2, smRegions_cat_1, smRegions_cat_2

role_Act, role_LoF:

one hot encoding of the tumorigenic mode of action

drivers

datasets/drivers.tsv

Lists the driver genes that have been found in each cohort by IntOGen <https://www.intogen.org>_.

Each item has the following fields:

SYMBOL, TRANSCRIPT, COHORT, CANCER_TYPE, METHODS, MUTATIONS, SAMPLES, %_SAMPLES_COHORT, QVALUE_COMBINATION, ROLE, CGC_GENE, CGC_CANCER_GENE, DOMAIN, 2D_CLUSTERS, 3D_CLUSTERS, EXCESS_MIS, EXCESS_NONEXCESS_SPL

Please, check IntOGen FAQs and Extended Documentation.

cohorts

datasets/cohorts.tsv provides some more description of each cohort used.

Each item has the following fields:

COHORT, CANCER_TYPE, CANCER_TYPE_NAME, SOURCE, PLATFORM, PROJECT, REFERENCE, TYPE, TREATED, AGE, SAMPLES, MUTATIONS, WEB_SHORT_COHORT_NAME, WEB_LONG_COHORT_NAME

Please, check IntOGen FAQs and Extended Documentation.

oncotree

Tumor type ontology defining high specific tumor type categories and a hierarchical structure to combine them. A detailed descrition is provided in the section Oncotree: tumor type ontology.

It consists of two files:

datasets/tree_cancer_types.json:

ontology hierarchy as a JSON dictionary

datasets/definitions.json:

definitions of all the tumor type acronyms used in the oncotree

discovery

discovery/discovery.tsv

Pre-computed summary information for each gene and tumor type where the following features are provided:

gene, ttype:

gene and tumor type

n_muts:

number of observed mutations in gene in samples matching the tumor type; more general terms in the oncotree ontology will comprise more samples

n_unique_muts:

among the mutations considered for n_muts, how many unique mutations are comprised

n_samples:

total number of samples matching tumor type with at least one mutation in the gene

discovery_index:

discovery index corresponding to each gene and tumor type

discovery_high:

discovery index upper (IQR) confidence bound

discovery_low:

discovery index lower (IQR) confidence bound

Output

Data splits for training

splitcv/<cohort-id>.cvdata.pickle.gz

Dictionaries indexed by the set of driver genes in the cohort.

For each gene there is a list of tuples (as many elements as base models) with 4 elements each:
  • x_train (pandas.DataFrame, feature data per training instance)

  • x_test (pandas.DataFrame, feature data per testing instance)

  • y_train (pandas.Series, response labels per training instance, binary)

  • y_test (pandas.Series, response labels per testing instance, binary)

The feature information has the following columns:

chr, pos, ref, alt, CLUSTL_SCORE, CLUSTL_cat_1, CLUSTL_cat_2, HotMaps_cat_1, HotMaps_cat_2, smRegions_cat_1, smRegions_cat_2, PhyloP, nmd, Acetylation, Methylation, Phosphorylation, Regulatory_Site, Ubiquitination, csqn_type_missense, csqn_type_nonsense, csqn_type_splicing, csqn_type_synonymous, role_Act, role_LoF

splitcv_meta/<tumor-type>/<gene>.cvdata.pickle.gz

Pickled list of tuples (as many elements as base models) of 4 elements each following the same structure as described above. These are the result of aggregating the splitting information tumor-type-wise.

Gradient boosting base classifiers

training_meta/<tumor-type>/<gene>.models.pickle.gz

Pickled dictionaries where the set of trained base models for a given gene and tumor type are kept. They comprise the following levels of information:

models:

list of trained base classifiers (instances of boostwrap.methods.Classifier)

x_test:

list of pandas.DataFrame instances with feature information used for testing at each base classifier

y_test:

list of pandas.Series instances with response labels used for testing at each base classifier

learning_curves:

list of dictionaries with performance evaluation vectors at train and test (validation_0 and validation_1) for representation graphical evaluation of the learning progress as a function of the number of estimators for each base classifier.

Evaluation of base classifiers

evaluation/<tumor-type>/<gene>.eval.pickle.gz

Performance information of the base models kept separately. Each data instance is a dictionary with correlative lists of values computed for each base classifier. We computed the following performance metrics:

auc:

area under the ROC curve

mcc:

Matthews correlation coefficient

logloss:

Log-loss or cross-entropy

precision:

Precision, a.k.a. positive predictive value (PPV), true-over-positive rate.

npv:

Negative predictive value, i.e. the false-over-negative rate.

recall:

Recall, a.k.a. sensitivity, i.e. the positive-over-true rate.

fscore:

F-score, i.e. harmonic mean of precision and recall.

\(F_{50}\) (fscore50):

\(F_{\beta}\) with \(\beta=0.5\); in particular, this score primes more precision over recall than the F-score.

accuracy:

Accuracy, i.e. rate of correctly classified mutations over all predictions.

balance:

Test dataset balance, i.e. deviation of the proportion of labels from 0.5; perfect balance should yield 0.

calibration:

Calibrarion, i.e. extent to which the average boostDM predicted score matches the proportion of positive labels – which in case of a balanced set would give 0.5. This is computed as \((\overline{\hat{y}_i} - \overline{y}_i) / \overline{y}_i\), where \(\hat{y}_i\) are the boostDM predicted values, \(y_i\) are the true labels, and bars denote averaging.

size:

Test size, i.e. total number of mutations in the test set.

Saturation predictions

saturation/prediction/<gene>.<tumor-type>.prediction.tsv.gz

Predictions for all possible mutations in the canonical transcript for a specific gene in a tumor type context. The file includes the following columns:

gene, ENSEMBL_TRANSCRIPT, ENSEMBL_GENE, chr, pos, alt, aachange, CLUSTL_SCORE, CLUSTL_cat_1, CLUSTL_cat_2, HotMaps_cat_1, HotMaps_cat_2, smRegions_cat_1, smRegions_cat_2, PhyloP, nmdAcetylation, Methylation, Phosphorylation, Regulatory_Site, Ubiquitination, csqn_type_missense, csqn_type_nonsense, csqn_type_splicing, csqn_type_synonymous, role_Act, role_LoF, selected_model_ttype, selected_model_gene, boostDM_score, boostDM_class, shap_CLUSTL_SCORE, shap_CLUSTL_cat_1, shap_CLUSTL_cat_2, shap_HotMaps_cat_1, shap_HotMaps_cat_2, shap_smRegions_cat_1, shap_smRegions_cat_2, shap_PhyloP, shap_nmd, shap_Acetylation, shap_Methylation, shap_Phosphorylation, shap_Regulatory_Site, shap_Ubiquitination, shap_csqn_type_missense, shap_csqn_type_nonsense, shap_csqn_type_splicing, shap_csqn_type_synonymous, shap_role_Act, shap_role_LoF

Columns with shap_ prefix:

They denote the SHAP values corresponding to the prefixed features. The meaning of these values are explained in the section Shapley Additive Explanations.

selected_gene_model, selected_model_ttype:

Represent the gene and tumor type context that was used to cast the predictions, in other words, which model was employed to cast the predictions in this case.

boostDM_score, boostDM_class:

Denote the prediction score and whether this score is higher than 0.5 which the method established as the threshold for driver potential.

Deprecation note:

At the moment the columns selected_gene_model, role_Act, role_LoF, shap_role_Act, shap_role_LoF are not informative. They were used in preliminary versions of the pipeline to handle analyses where pools of mutations from distinct genes were used for training meta-gene models. The current approach precludes this analysis and these columns will not be supported in forthcoming versions of the tool.