Pipeline
Availability
A public version of the pipeline is available in this git repo.
Pre-requisites and running
Follow the instructions of the repo README.
Input
The full pipeline feeds on several data sources, mainly coming from IntOGen pipeline, but also others. A full installation of IntOGen is required to properly handle the full preprocessing of mutations and features that ultimately yields the training data.
For the public version we provide a simplified pipeline that starts with a preprocessed input that summarizes the training data alongside all the necessary feature annotations, driver genes and cohort data.
regression data
create_datasets/<cohort-code>.regression_data.tsv
Collection of positive and negative training mutations alongside their features for a given cohort.
For each mutation it encompasses the following information:
- chr:
chromosome
- pos:
position in genomic coordinates (hg38)
- ref:
reference allele
- gene:
gene
- alt:
alternate allele
- PhyloP:
PhyloP conservation score,
- aachange:
amino acid change
- nmd:
whether a stop mutation maps to the last coding exon
- Acetylation, Methylation, Phosphorylation, Regulatory_Site, Ubiquitination:
whether the mutation maps to a residue subject to post-translational modification.
- CLUSTL_SCORE, CLUSTL_cat_1, CLUSTL_cat_2:
oncodriveCLUSTL scores
- motif:
whether the mutation maps to a significantly enriched Pfam domain according to smRegions,
- csqn_type_missense, csqn_type_nonsense, csqn_type_splicing, csqn_type_synonymous:
simplified protein coding consequence type
- HotMaps_cat_1, HotMaps_cat_2:
HotMAPs features (tumor type specific and pan-cancer)
- smRegions_cat_1, smRegions_cat_2:
smRegions features (tumor type specific and pan-cancer)
- role_Act, role_LoF:
oncogenic mode of action of the gene (oncogene or tumor-suppresor)
- response:
whether the mutation is labeled as positive or negative mutation for supervised learning
saturation annotation files
saturation/annotation/<tumor-type>/<gene>.annotated.out.gz
Comprehensive catalogue of all the mutations mapping to the canonical transcript for each gene in each tumor type context according to the results of IntOGen pipeline, VEP and PhosphositePlus.
For each mutation it encompasses the following information:
- chr, pos
hg38 genomic coordinates
- alt
alternate allele
ENSEMBL_GENE, ENSEMBL_TRANSCRIPT, Feature_type, cDNA_position, CDS_position, Protein_position, Amino_acids, Codons Existing_variation, IMPACT, DISTANCE, STRAND, FLAGS, gene, SYMBOL_SOURCE, HGNC_ID, CANONICAL, ENSP, EXON, INTRON
customary VEP annotations, a description can be found here: VEP feature description
- boostDM features:
PhyloP, aachange, nmd, Acetylation, Methylation, Phosphorylation, Regulatory_Site, Ubiquitination, CLUSTL_SCORE, motif, csqn_type_missense, csqn_type_nonsense, csqn_type_splicing, csqn_type_synonymous, CLUSTL_cat_1, CLUSTL_cat_2, HotMaps_cat_1, HotMaps_cat_2, smRegions_cat_1, smRegions_cat_2
- role_Act, role_LoF:
one hot encoding of the tumorigenic mode of action
drivers
datasets/drivers.tsv
Lists the driver genes that have been found in each cohort by IntOGen <https://www.intogen.org>_.
Each item has the following fields:
SYMBOL, TRANSCRIPT, COHORT, CANCER_TYPE, METHODS, MUTATIONS, SAMPLES, %_SAMPLES_COHORT, QVALUE_COMBINATION, ROLE, CGC_GENE, CGC_CANCER_GENE, DOMAIN, 2D_CLUSTERS, 3D_CLUSTERS, EXCESS_MIS, EXCESS_NONEXCESS_SPL
Please, check IntOGen FAQs and Extended Documentation.
cohorts
datasets/cohorts.tsv
provides some more description of each cohort used.
Each item has the following fields:
COHORT, CANCER_TYPE, CANCER_TYPE_NAME, SOURCE, PLATFORM, PROJECT, REFERENCE, TYPE, TREATED, AGE, SAMPLES, MUTATIONS, WEB_SHORT_COHORT_NAME, WEB_LONG_COHORT_NAME
Please, check IntOGen FAQs and Extended Documentation.
oncotree
Tumor type ontology defining high specific tumor type categories and a hierarchical structure to combine them. A detailed descrition is provided in the section Oncotree: tumor type ontology.
It consists of two files:
datasets/tree_cancer_types.json
:ontology hierarchy as a JSON dictionary
datasets/definitions.json
:definitions of all the tumor type acronyms used in the oncotree
discovery
discovery/discovery.tsv
Pre-computed summary information for each gene and tumor type where the following features are provided:
- gene, ttype:
gene and tumor type
- n_muts:
number of observed mutations in gene in samples matching the tumor type; more general terms in the oncotree ontology will comprise more samples
- n_unique_muts:
among the mutations considered for n_muts, how many unique mutations are comprised
- n_samples:
total number of samples matching tumor type with at least one mutation in the gene
- discovery_index:
discovery index corresponding to each gene and tumor type
- discovery_high:
discovery index upper (IQR) confidence bound
- discovery_low:
discovery index lower (IQR) confidence bound
Output
Data splits for training
splitcv/<cohort-id>.cvdata.pickle.gz
Dictionaries indexed by the set of driver genes in the cohort.
- For each gene there is a list of tuples (as many elements as base models) with 4 elements each:
x_train (pandas.DataFrame, feature data per training instance)
x_test (pandas.DataFrame, feature data per testing instance)
y_train (pandas.Series, response labels per training instance, binary)
y_test (pandas.Series, response labels per testing instance, binary)
The feature information has the following columns:
chr, pos, ref, alt, CLUSTL_SCORE, CLUSTL_cat_1, CLUSTL_cat_2, HotMaps_cat_1, HotMaps_cat_2, smRegions_cat_1, smRegions_cat_2, PhyloP, nmd, Acetylation, Methylation, Phosphorylation, Regulatory_Site, Ubiquitination, csqn_type_missense, csqn_type_nonsense, csqn_type_splicing, csqn_type_synonymous, role_Act, role_LoF
splitcv_meta/<tumor-type>/<gene>.cvdata.pickle.gz
Pickled list of tuples (as many elements as base models) of 4 elements each following the same structure as described above. These are the result of aggregating the splitting information tumor-type-wise.
Gradient boosting base classifiers
training_meta/<tumor-type>/<gene>.models.pickle.gz
Pickled dictionaries where the set of trained base models for a given gene and tumor type are kept. They comprise the following levels of information:
- models:
list of trained base classifiers (instances of boostwrap.methods.Classifier)
- x_test:
list of pandas.DataFrame instances with feature information used for testing at each base classifier
- y_test:
list of pandas.Series instances with response labels used for testing at each base classifier
- learning_curves:
list of dictionaries with performance evaluation vectors at train and test (
validation_0
andvalidation_1
) for representation graphical evaluation of the learning progress as a function of the number of estimators for each base classifier.
Evaluation of base classifiers
evaluation/<tumor-type>/<gene>.eval.pickle.gz
Performance information of the base models kept separately. Each data instance is a dictionary with correlative lists of values computed for each base classifier. We computed the following performance metrics:
- auc:
area under the ROC curve
- mcc:
Matthews correlation coefficient
- logloss:
Log-loss or cross-entropy
- precision:
Precision, a.k.a. positive predictive value (PPV), true-over-positive rate.
- npv:
Negative predictive value, i.e. the false-over-negative rate.
- recall:
Recall, a.k.a. sensitivity, i.e. the positive-over-true rate.
- fscore:
F-score, i.e. harmonic mean of precision and recall.
- \(F_{50}\) (fscore50):
\(F_{\beta}\) with \(\beta=0.5\); in particular, this score primes more precision over recall than the F-score.
- accuracy:
Accuracy, i.e. rate of correctly classified mutations over all predictions.
- balance:
Test dataset balance, i.e. deviation of the proportion of labels from 0.5; perfect balance should yield 0.
- calibration:
Calibrarion, i.e. extent to which the average boostDM predicted score matches the proportion of positive labels – which in case of a balanced set would give 0.5. This is computed as \((\overline{\hat{y}_i} - \overline{y}_i) / \overline{y}_i\), where \(\hat{y}_i\) are the boostDM predicted values, \(y_i\) are the true labels, and bars denote averaging.
- size:
Test size, i.e. total number of mutations in the test set.
Saturation predictions
saturation/prediction/<gene>.<tumor-type>.prediction.tsv.gz
Predictions for all possible mutations in the canonical transcript for a specific gene in a tumor type context. The file includes the following columns:
gene, ENSEMBL_TRANSCRIPT, ENSEMBL_GENE, chr, pos, alt, aachange, CLUSTL_SCORE, CLUSTL_cat_1, CLUSTL_cat_2, HotMaps_cat_1, HotMaps_cat_2, smRegions_cat_1, smRegions_cat_2, PhyloP, nmdAcetylation, Methylation, Phosphorylation, Regulatory_Site, Ubiquitination, csqn_type_missense, csqn_type_nonsense, csqn_type_splicing, csqn_type_synonymous, role_Act, role_LoF, selected_model_ttype, selected_model_gene, boostDM_score, boostDM_class, shap_CLUSTL_SCORE, shap_CLUSTL_cat_1, shap_CLUSTL_cat_2, shap_HotMaps_cat_1, shap_HotMaps_cat_2, shap_smRegions_cat_1, shap_smRegions_cat_2, shap_PhyloP, shap_nmd, shap_Acetylation, shap_Methylation, shap_Phosphorylation, shap_Regulatory_Site, shap_Ubiquitination, shap_csqn_type_missense, shap_csqn_type_nonsense, shap_csqn_type_splicing, shap_csqn_type_synonymous, shap_role_Act, shap_role_LoF
- Columns with
shap_
prefix: They denote the SHAP values corresponding to the prefixed features. The meaning of these values are explained in the section Shapley Additive Explanations.
- selected_gene_model, selected_model_ttype:
Represent the gene and tumor type context that was used to cast the predictions, in other words, which model was employed to cast the predictions in this case.
- boostDM_score, boostDM_class:
Denote the prediction score and whether this score is higher than 0.5 which the method established as the threshold for driver potential.
- Deprecation note:
At the moment the columns selected_gene_model, role_Act, role_LoF, shap_role_Act, shap_role_LoF are not informative. They were used in preliminary versions of the pipeline to handle analyses where pools of mutations from distinct genes were used for training meta-gene models. The current approach precludes this analysis and these columns will not be supported in forthcoming versions of the tool.