Google, Stanford use machine learning on 37.8m data points for drug discovery

Deep learning and multitask networks used on 259 datasets

Researchers from Google and Stanford University have used machine learning methods – deep learning and multitask networks – to discover effective drug treatments for a variety of diseases.

Deep learning, which deals with many hidden layers in artificial neural networks, enables scientists to synthesise large amounts of data into predictive models. Multitask networks compensate for limited data within an experiment, and allow data to be shared across different experiments.

“Discovering new treatments for human diseases is an immensely complicated challenge. Even after extensive research to develop a biological understanding of a disease, an effective therapeutic that can improve the quality of life must still be found,” the researchers wrote on the Google Research Blog.

“This process often takes years of research, requiring the creation and testing of millions of drug-like compounds in an effort to find a just a few viable drug treatment candidates.”

The researchers added that high-throughput screening (rapid automated screening of diverse compounds) is expensive and is usually done in sophisticated labs, which means it may not be the most practical solution.

Applying machine learning to virtual screening (similar to high-throughput screening) is another way to go about drug discovery, but low hit rates resulting in imbalanced datasets and “paucity” of experimental data resulting in overfitting (noise in the training data) remain as challenges, the researchers said.

“Virtual screening attempts to replace or augment the high-throughput screening process by the use of computational methods. Machine learning methods have frequently been applied to virtual screening by training supervised classifiers to predict interactions between targets and small molecules.

“The overall complexity of the virtual screening problem has limited the impact of machine learning in drug discovery,” the researchers wrote in their paper called Massively Multitask Networks for Drug Discovery.

The researchers worked with 259 publicly available datasets on biological processes, which contained 37.8 million data points for 1.6 million compounds.

The datasets were made up of 128 experiments in the PubChem BioAssay database (PCBA), 17 datasets to avoid common pitfalls in virtual screening (MUV), and 102 datasets to evaluate methods to predict interactions between proteins and small molecules (DUD-E).

There were also 12 datasets from the 2014 Tox21 data challenge, run by the National Center for Advancing Translational Sciences in the US. The goal of Tox21 is to crowdsource data analysis conducted by independent researchers to discover how they can predict compounds' interference in biochemical pathways using only chemical structure data.

“Because of our large scale, we were able to carefully probe the sensitivity of these models to a variety of changes in model structure and input data," the researchers wrote.

“We carefully quantified how the amount and diversity of screening data from a variety of diseases with very different biological processes can be used to improve the virtual drug screening predictions.

“Our models are able to utilise data from many different experiments to increase prediction accuracy across many diseases," the researchers wrote.

The learning models were evaluated using 'area under the receiver operating characteristic (ROC) curve', a measure for classification accuracy.

“The imbalance present in our datasets means that performance varies widely depending on the particular training/test split. To compensate for this variability, we used stratified Kfold cross-validation; that is, each fold maintains the active/inactive proportion present in the unsplit data,” added the researchers.

A key finding was that multitask networks allows for significantly more accurate predictions than single-task methods. Also, their predictive capability improves as more tasks and data is added to the models and large multitask networks resulted in better transferability to tasks not contained in the training data.

The researchers noted in their paper that access to more relevant data is key to being able to build state-of-the-art models.

“Major pharmaceutical companies possess vast private stores of experimental measurements; our work provides a strong argument that increased data sharing could result in benefits for all.”

The researchers also wrote that it’s “disappointing… that all published applications of deep learning to virtual screening (that we are aware of) use distinct datasets that are not directly comparable”, meaning standards for datasets and performance metrics need to be established.

“Another direction for future work is the further study of small molecule featurization. In this work, we use only one possible featurization (ECFP4), but there exist many others,” the researchers said.

Follow CIO Australia on Twitter and Like us on Facebook… Twitter: @CIO_Australia, Facebook: CIO Australia, or take part in the CIO conversation on LinkedIn: CIO Australia

Follow Rebecca Merrett on Twitter: @Rebecca_Merrett