Pneumonia Detection from Chest X-Rays

Running the scripts:

Execution order:

eda.ipynb
model.ipynb
inference.ipynb

Overview

A CNN is trained on the NIH Chest X-ray Dataset to classify a given chest x-ray for the presence or absence of pneumonia. The goal is to create a ML system that can predict the presence of pneumonia with human radiologist-level accuracy that could be prepared for submission to the FDA for 510(k) clearance as software as a medical device.

The dataset contains 112,000 chest x-ray images with disease labels acquired from 30,000 patients with clinical labels for each image that were extracted from radiology reports.

Pneumonia and X-Rays in the Wild

Chest X-ray exams are one of the most frequent and cost-effective types of medical imaging examinations. Deriving clinical diagnoses from chest X-rays can be challenging, however, even by skilled radiologists.

When it comes to pneumonia, chest X-rays are the best available method for diagnosis. More than 1 million adults are hospitalized with pneumonia and around 50,000 die from the disease every year in the US alone. The high prevalence of pneumonia makes it a good candidate for the development of a deep learning application for two reasons: 1) Data availability in a high enough quantity for training deep learning models for image classification 2) Opportunity for clinical aid by providing higher accuracy image reads of a difficult-to-diagnose disease and/or reduce clinical burnout by performing automated reads of very common scans.

The diagnosis of pneumonia from chest X-rays is difficult for several reasons:

The appearance of pneumonia in a chest X-ray can be very vague depending on the stage of the infection
Pneumonia often overlaps with other diagnoses
Pneumonia can mimic benign abnormalities

For these reasons, common methods of diagnostic validation performed in the clinical setting are to obtain sputum cultures to test for the presence of bacteria or viral bodies that cause pneumonia, reading the patient's clinical history and taking their demographic profile into account, and comparing a current image to prior chest X-rays for the same patient if they are available.

About the Dataset

The NIH Chest X-ray dataset was curated by the NIH specifically to address the problem of a lack of large x-ray datasets with ground truth labels to be used in the creation of disease detection algorithms.

The data can be downloaded from the kaggle website.

There are 112,120 X-ray images with disease labels from 30,805 unique patients in this dataset. The disease labels were created using Natural Language Processing (NLP) to mine the associated radiological reports. The labels include 14 common thoracic pathologies:

Atelectasis
Consolidation
Infiltration
Pneumothorax
Edema
Emphysema
Fibrosis
Effusion
Pneumonia
Pleural thickening
Cardiomegaly
Nodule
Mass
Hernia

The biggest limitation of this dataset is that image labels were NLP-extracted so there could be some erroneous labels but the NLP labeling accuracy is estimated to be >90%.

The original radiology reports are not publicly available but you can find more details on the labeling process here.

Dataset Contents:

112,120 frontal-view chest X-ray PNG images in 1024*1024 resolution
Meta data for all images (Data_Entry_2017.csv): Image Index, Finding Labels, Follow-up #, Patient ID, Patient Age, Patient Gender, View Position, Original Image Size and Original Image Pixel Spacing.

Results in form of a FDA Submission

Name of Device: PneumoniaXNet

Algorithm Description

General Information

Intended Use Statement

Assisting radiologists in the detection of pneumonia on X-ray chest images. It is explicitly stated that the AI based algorithm PneumoniaXNet is intended to be used under the supervision of an expert like a radiologist or other clinician with expert knowledge.

Indications for Use

Automated detection of pneumonia from chest X-rays in non-emergency clinical settings. The device PneumoniaXNet is intended to use as a diagnostic tool if the following conditions are fullfilled:

Predictions on male and female patients (see figure 55 for data distribution) between the ages of about 10 to 90 years (see figure 54 for data distribution).
Chest X-ray view positions: posteroanterior (PA) view and/or erect anteroposterior (AP) chest view (see figure 56 for data distribution).
Optionally (on rare occasions): Analysing chest X-rays in regions with inadequate access to diagnostic imaging specialists. Getting a rough idea of the disease of a patient is better than nothing.

Description of a possible clinical setting:

Obtaining X-ray image of a patient’s chest (PA or AP view).
Sending scan in DICOM format to a remote server with installed PneumoniaXNet software for processing.
Checking the DICOM file for compatibility (see 1.2.1).
If the scan passes the compatibility check it is preprocessed (see 1.2.2).
Feed the X-ray image into the machine learning algorithm PneumoniaXNet.
The PneumoniaXNet classifier will output one of two diagnostical predictions: patient has pneumonia (1) or patient has no pneumonia (0).
The result is sent to a radiologist who will validate the result and compile a final diagnosis.

Device Limitations

The PneumoniaXNet device does not achieve 100% accuracy. Therefore it is advised that this classifiers predictions are only used as a supplementary diagnosis tool. The final diagnosis should always be compiled by an expert like a radiologist.
PneumoniaXNet should be run on a CUDA capable GPU (e.g. local server or cloud server with GPU access). This is especially important if the algorithm is used in situations where getting a result quickly is important

Clinical Impact of Performance

Enhancing workflow by providing fast and reliable pneumonia detection from X-ray images
If algorithm predicts a positive pneumonia case, a radiologist can prioritize to analyse this case more urgently. The patient could be treated sooner.
It is strongly recommended that the device is used as an assisting device for an imaging specialist. If the algorithm predicts a false positive (FP) the radiologist can always intervene and prevent a patient from a useless pneumonia treatment. The only downside of a false positive should be some wasted time. If the algorithm predicts a false negative (FN) this could lead to a loss of time for the patient’s treatment. A trained radiologist should detect the pneumonia disease in this sirious case. False negatives are much more severe than false positives. Therefore the PneumoniaXNet device tries to prevent false negatives by adjusting the classifier threshold (see section 1.3.2).

Algorithm Design and Function

DICOM Checking Steps

Before a DICOM file is preprocessed and fed to the PneumoniaXNet algorithm it is checked if it contains the correct properties. The following things are checked:

Modality must be ’DX’.
Body part must be ’chest’.
Patient position must be ’PA’ or ’AP’.

Preprocessing Steps

The following preprocessing steps are performed on each image before it is fed into the PneumoniaXNet algorithm:

resizing image to spatial size 224x224 pixels
convert image from grayscale to RGB with dimensions 1x224x224x3 (the first dimension is the batch size, multiple images can be fed simultaneously in batches into the algorithm)
DenseNet121 specific preprocessing steps: input pixel values are scaled between 0 and 1 and each channel is normalized with respect to the ImageNet dataset

CNN Architecture

describe the architecture of the classifier We use the Keras implementation of the DenseNet121 model as base model. We cut the last few (dense) layers of the net and replace them with the following additional layers:

Global average pooling layer
dense layer with 1024 neurons and ReLu activation with dropout 0.5
dense layer with 512 neurons and ReLu activation with dropout 0.5
dense layer with 256 neurons and ReLu activation
dense layer with 1 neuron and sigmoid activation (output layer)

We freeze all trainable parameters of the base net except of the parameters of the last 7 layers. These parameters in addition to the parameters of the additional layers above sum up to the trainable parameters. In total we have 1,872,129 trainable parameters. This so called transfer learning is a common and efficient way of training convolutional neural networks. The imported DenseNet121 model has previously been trained on the ImageNet dataset.

Algorithm Training

Parameters used for training:

Batch size: 16 images
Optimizer learning rate: Adam optimizer with initial learning rate of 0.001
image augmentation used during training:
- horizontal flip: useful because we have X-ray images from both viewing positions PA and AP. The algorithm will be able to predict pneumonia presence from both viewing positions.
- rotation range 10 degrees
- height shift range 0.1
- width shift range 0.1
- shear range 0.1
- zoom range 0.15

The model performance and training progress can be seen in figures 46, 47, and 48

Final Threshold and Explanation

Our goal is to minimize the false negative (FN) predictions. Since the F1 score is proportional to 1/FN we are interested in maximizing the F1 score. Furthermore, the F1 metrics enables us to compare our results with previous work (see 1.6.3).

To maximize the F1 score we chose the prediction threshold 0.216 (see figure 51).

Databases

We train the PneumoniaXNet algorithm on the National Institutes of Health Chest X-Ray Dataset. This dataset is comprised of 112,120 X-ray images from CT scans with disease labels from 30,805 unique patients. It was not designed specifically for detecting pneumonia disease. It also contains other diseases. Patients might have multiple diseases simultaneously (see figures 57 and 58). Altogether, it contains 15 classes with 14 diseases and one class for ’no findings’ (no disease of the 14 diseases in this dataset). The patients’ age, gender and viewing position of the X-ray images are depicted in figures 54, 55 and 56. The prevalence of the most common diseases in the dataset is visualized in figure 57. When randomly splitting the dataset into training, validation and testing sets it is ensured that the ratios in all demographics are roughly maintained.

We preprocess the NIH chest X-ray dataset before we split it. For this we convert all patient ages to the unit year and delete all patients with age > 100 from the dataset (this is the case for 16 data points). The adjusted dataset contains 1430 images with pneumonia disease labels which corresponds to ~1.3% of all data points. We can say that according to the

Description of Training Dataset

80% of the the patients in the NIH chest X-ray dataset are assigned to the (raw) training set. We split the dataset by patient to ensure that a patient can only be in one dataset (training, validation or testing set). The training set contains much more negative samples (no pneumonia) than positive samples (has pneumonia). We balance the training set by randomly chosing negative samples until we have the number of positive and negative samples. The rest negative samples in this dataset are discarded.

The final cardinality of the training dataset is 2302.

Description of Validation Dataset

10% of all patients are assigned to the validation dataset. This set is used for picking the best performing model (weights) during training. In a clinical setting we assume a higher incidence of X-ray images with pneumonia than in a dataset which maps a ’complete population’. Therefore we do not sample data points from the validation set until we reach 1.3% of pneumonia cases in the dataset. Instead, we randomly sample data points so that we get a ratio of 20%/80% of positive and negative samples.

The final cardinality of the validation dataset is 745.

Description of Testing Dataset

10% of the patients are used for testing the performance of the PneumoniaXNet algorithm. This dataset is used to finally evaluate the model performance. It has the same ration between positive and negative samples as the validation dataset: 20%/80%.

The final cardinality of the testing dataset is 640.

Ground Truth

The ground truth for the used data was created by the NIH. They extracted the labels with the help of a NLP algorithm running over radiology reports which are not publicly available. This process is prone to some erroneous labels because the NLP algorithm might misinterpret complex sentence structures. The NIH reports a NLP labeling accuracy of >90%. The NIH states that they had to deal with uncertainties in the radiology reports (see also kaggle data source). Often they classified such uncertain cases as ’no finding’. The ’no finding’ label can also contain diseases which are not considered in this dataset. This means that the ’no finding’ label might still contain some diseases instead of being a scan of a healthy subject. All these limitations in data labels translate directly to the resulting algorithm which was trained on this data.

On the other hand the benefit of this method is to be able to label huge datasets in a very fast and cost efficient way.

FDA Validation Plan

Patient Population Description for FDA Validation Dataset

In this section we consider an ideal dataset that might be constructed by a clinical partner for the FDA validation dataset. The demographics of the FDA validation dataset should be similar to the NIH chest X-ray dataset on which the PneumoniaXNet algorithm was trained. This means:

Age ranges: 10 to 90 years
sex: male and females
type of imaging modality: DX (digital radiology)
body part imaged: chest
prevalence of disease of interest: 20 % so that it matches the validation set used to evaluate the PneumoniaXNet algorithm

Ground Truth Acquisition Methodology

The most affordable and reliable method of acquiring ground truth labels is to get multiple experts, e.g. radiologists to label the images for presence of pneumonia. A majority vote for each image would reveal the ground truth. This is the silver standard approach.

The gold standard approach would be to take pathological samples of the tissue. This process is very time-consuming and expensive. If this method is available, even for a sub-sample of the available data, it will be valuable for evaluating the performance of the algorithm.

Algorithm Performance Standard

In a previous study done by Rajpurkar et al. the authors trained an algortihm for detecting pneumonia. They measured the performance of their model by comparing the F1 score of the model predictions with the averaged F1 score of four expert radiologists’ predictions. To get comparable evaluation results we use the F1 score as performance metric. The F1 score is the harmonic mean of recall and precission. The four expert radiologists achieved an averaged F1 score of 0.387. We use this “radiologist-level value” as standard to beat.

With a F1 score of 0.41 the PneumoniaXNet algorithm performs at least as good as expert radiologists.

This project is part of the Udacity Nanodegree programm "AI in Healthcare" (November 2020).

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
figs		figs
.gitignore		.gitignore
README.md		README.md
eda.ipynb		eda.ipynb
fda_submission.md		fda_submission.md
fda_submission.org		fda_submission.org
fda_submission.pdf		fda_submission.pdf
inference.ipynb		inference.ipynb
model.ipynb		model.ipynb
requirements.yaml		requirements.yaml
sample_labels.csv		sample_labels.csv
test1.dcm		test1.dcm
test2.dcm		test2.dcm
test3.dcm		test3.dcm
test4.dcm		test4.dcm
test5.dcm		test5.dcm
test6.dcm		test6.dcm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pneumonia Detection from Chest X-Rays

Running the scripts:

Overview

Pneumonia and X-Rays in the Wild

About the Dataset

Dataset Contents:

Results in form of a FDA Submission

Algorithm Description

General Information

Intended Use Statement

Indications for Use

Device Limitations

Clinical Impact of Performance

Algorithm Design and Function

DICOM Checking Steps

Preprocessing Steps

CNN Architecture

Algorithm Training

Parameters used for training:

Final Threshold and Explanation

Databases

Description of Training Dataset

Description of Validation Dataset

Description of Testing Dataset

Ground Truth

FDA Validation Plan

Patient Population Description for FDA Validation Dataset

Ground Truth Acquisition Methodology

Algorithm Performance Standard

About

Uh oh!

Releases

Packages

Languages

phthaloc/x_ray_pneumonia_detection

Folders and files

Latest commit

History

Repository files navigation

Pneumonia Detection from Chest X-Rays

Running the scripts:

Overview

Pneumonia and X-Rays in the Wild

About the Dataset

Dataset Contents:

Results in form of a FDA Submission

Algorithm Description

General Information

Intended Use Statement

Indications for Use

Device Limitations

Clinical Impact of Performance

Algorithm Design and Function

DICOM Checking Steps

Preprocessing Steps

CNN Architecture

Algorithm Training

Parameters used for training:

Final Threshold and Explanation

Databases

Description of Training Dataset

Description of Validation Dataset

Description of Testing Dataset

Ground Truth

FDA Validation Plan

Patient Population Description for FDA Validation Dataset

Ground Truth Acquisition Methodology

Algorithm Performance Standard

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages