Motivation

To be able to identify individual cancer mutations a novel approach is required that annotates all possible mutations in a gene, independently of their probability of occurrence, as potential drivers or passengers. Instead of relying on functional impact metrics [Kircher et al., 2014, Pollard et al., 2010], this method should measure the ability of a mutation to drive tumorigenesis. Moreover, as the function of each cancer gene is different, as well as their role in tumorigenesis, we can expect that the features that define driver mutations will be different per gene. Thus, we aim to create an approach that learns the features that define driver mutations for each cancer gene independently. In addition, the same cancer gene may have different mechanisms of tumorigenesis in different tissues (e.g. compare EGFR in Lung adenocarcinoma and Glioblastoma). Thus, if enough data is available, we aim to create gene and cancer type specific models. Furthermore, it would be desirable that such classification yields human-readable results, which help researchers point at the key features defining driver mutations in a cancer gene.

This problem has been approached before through experiments of saturation mutagenesis, in which all possible mutants of a cancer gene are generated and their impact on protein function [Kakudo et al., 2005, Kato et al., 2003, Kawaguchi et al., 2005], or cell viability [Findlay et al., 2018, Mighell et al., 2018] are assessed. These experiments possess obvious technical and economic hurdles. Furthermore, due to limitations imposed by the experimental setup, these approaches do not directly measure the tumorigenic potential of mutations, but rather some proxy, such as their functional impact. For instance, in certain tumor suppressor genes, saturation mutagenesis experiments have been conducted in haploid human cells to identify mutations that abrogate cell viability [Findlay et al., 2018]. Only scattered mutagenesis assays have been carried out that actually assess the tumorigenic potential of mutations affecting cancer genes, restricted to few cell types, which do not represent the wide spectrum of tissue-specific constraints. Generalizing them to cover hundreds of cancer genes across cell types representing different tissues would be a herculean task.

To address this problem we have developed a platform to train machine learning models to identify all possible driver mutations in cancer genes across cancer types. This document presents details of the methodology, features and data used to generate these gene-tumor type specific models, as well as an extensive benchmark of their performance. Then we provide an extensive description of the validation and comparison experiments to critically assess our approach, including hold-out testing with experimentally validated rare oncogenic variants, comparisons with experimental saturation mutagenesis of 5 genes and with several bioinformatics tools. We also re-trained models with subsamples to assess the growth outlook of the pipeline as more sequenced tumors become available. Finally, we include a short section where we justify the choice of the exponential function used in the definition of the discovery index.