Data scientists compete to create cancer-detection algorithms

Nearly 10,000 data scientists recently competed in the Data Science Bowl to develop machine learning algorithms that can more accurately detect cancerous lesions in CT scans.

Comments

Data scientists are using machine learning to tackle lung cancer detection. Beginning in January, nearly 10,000 data scientists around the world competed in the Data Science Bowl to develop the most effective algorithm to help medical professionals detect lung cancer earlier and with better accuracy.

In 2010, the National Lung Screening Trial showed that annual screening with low-dose computed tomography (CT) â a scanner that uses computer-processed combinations of many X-ray images from different angles to generate high-contrast 3D images â could reduce lung cancer deaths by 20 percent. While a breakthrough for early detection, the technology has also resulted in a relatively high rate of false positives compared with more traditional X-rays.

nlst data ct scan image — **An anonymized high-res lung scan from the NCI, which Data Science Bowls participants used when developing algorithms. (Click for a larger image.)**

"It's a really powerful approach that's reduced cancer deaths by 20 percent, but there's a very high rate of false positives," says Anthony Goldbloom, CEO of machine learning company Kaggle, which, with partner Booz Allen Hamilton, presents the annual Data Science Bowl. "A huge amount of people have been told they have cancer only find out later that they don't. There's a human cost to that. It's incredibly stressful."

So for this year's Data Science Bowl, Booz Allen and Kaggle decided to direct the power of data science and machine learning to tackle the false positives problem. The partners secured a $1 million prize purse, funded by the Laura and John Arnold Foundation, to be split among the top-10 contestants.

Data science for social good

Booz Allen and Kaggle created the Data Science Bowl in 2015 in an effort to focus data scientists on social good, says Josh Sullivan, senior vice president and chief data scientist for Booz Allen.

"We wanted to create something that galvanized people to come together to do something for social good, something bigger than themselves," he says. "How can we do something for social good that's pretty substantial? We wanted it to be something that would result in scientific discovery. Something open to the public; not for our benefit or our clients' benefit, but open source and crowd sourced to people around the world."

Sullivan says more than 300 ideas were submitted for the focus of the third annual Data Science Bowl (previous Data Science Bowls have focused on algorithms for determining ocean health and detecting heart disease). Ultimately, he says, the partners decided they would help the National Cancer Institute (NCI) with its Beau Biden Cancer Moonshot, an effort to accelerate cancer research to make more therapies available to more patients, and to improve cancer prevention and early detection.

NCI supplied the Data Science Bowl with 2,000 anonymized, high-resolution CT scans, each image containing gigabytes of data. Sullivan says 1,500 of the images were the training set, accompanied by the final diagnosis. The remaining 500 images were the problem set. Using the training set, competitors' machine learning algorithms had to learn how to correctly determine whether lesions in the lungs were cancerous in the remaining 500 images. The algorithms were scored based on the percentage of correct diagnoses.

lung illustration web 3.5 — **Understanding lung cancer. (Click for larger image.)**

The data was packaged on Kaggle's platform. Kaggle, acquired by Google in March, was founded by Goldbloom in 2010, specifically to host predictive modeling and analytics competitions. Companies and researchers post their data, allowing data scientists to compete to produce the best models. The company has hundreds of thousands of registered 'Kagglers' that span nearly 200 countries.

For this competition, the Kagglers were specialists in convolutional neural networks (CNN), a type of deep learning neural network inspired by the visual mechanisms in living organisms. While useful for many different types of problems, CNN excels at computer vision problems. In a previous Kaggle competition, Kagglers competed to create CNN-based algorithms that could differentiate pictures of dogs and cats on social media.

"This data was quite novel," Goldbloom says of the CT images provided by NCI. "It really pushed convolutional neural networks in a direction they haven't gone before. Medical data sets are always a challenge because of the size of the data sets. How many cat and dog images are there on the internet? Probably millions. But medical images are all extremely expensive to collect. Fewer people have CT scans than take pictures of their dogs and cats."

And CNNs, Goldbloom explains, are very prone to an effect called "overfitting," in which the statistical model tends to describe noise rather than the underlying relationship because there are too many parameters relative to the number of observations.

"Building a convolutional neural network that doesn't overfit is difficult, and gets more difficult the smaller the data set," Goldbloom says. "That's really where the skill comes in. It must generalize well on a relatively small number of images."

Nearly 10,000 Kagglers participated in the Data Science Bowl. Collectively, they spent more than 150,000 hours and submitted nearly 18,000 algorithms. A number of radiologists volunteered their expertise on Kaggle's forums to help the competitors refine their efforts.

Data Science Bowl winners

In the end, the first place winners were Liao Fangzhou and Zhe Li, two researchers from China's Tsinghua University. Julian de Wit and Daniel Hammack, software and machine learning engineers in the Netherlands, took second place. Team Aidence, composed of members that work for a Netherlands-based company that applies deep learning to medical image interpretation, took third place.

"The NIH [National Institutes of Health] are going to end up working with the [U.S.] Food and Drug Administration and hopefully pipeline these analytics so they can go into the software that's actually reading these CT scans," Sullivan says. "That's the huge benefit we're trying to drive for."

He notes that he expects the NIH and FDA to look at a number of the top-placing algorithms. The top teams all scored within a fraction of a percent of each other, and some may be more production-ready or better able to scale.