This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on https://formative.jmir.org, as well as this copyright and license information must be included.

Machine learning applications in the health care domain can have a great impact on people’s lives. At the same time, medical data is usually big, requiring a significant number of computational resources. Although this might not be a problem for the wide adoption of machine learning tools in high-income countries, the availability of computational resources can be limited in low-income countries and on mobile devices. This can limit many people from benefiting from the advancement in machine learning applications in the field of health care.

In this study, we explore three methods to increase the computational efficiency and reduce model sizes of either recurrent neural networks (RNNs) or feedforward deep neural networks (DNNs) without compromising their accuracy.

We used inpatient mortality prediction as our case analysis upon review of an intensive care unit dataset. We reduced the size of RNN and DNN by applying pruning of “unused” neurons. Additionally, we modified the RNN structure by adding a hidden layer to the RNN cell but reducing the total number of recurrent layers to accomplish a reduction of the total parameters used in the network. Finally, we implemented quantization on DNN by forcing the weights to be 8 bits instead of 32 bits.

We found that all methods increased implementation efficiency, including training speed, memory size, and inference speed, without reducing the accuracy of mortality prediction.

Our findings suggest that neural network condensation allows for the implementation of sophisticated neural network algorithms on devices with lower computational resources.

Machine learning applications for health care can have a great impact on people’s lives. Currently, the possibilities for machine learning in health care include diagnostic systems, biochemical analysis, image analysis, and drug development. One of the most significant challenges in using machine learning for health care applications is that data is usually huge and sparse, requiring important computational resources, especially for overparameterized deep neural networks (DNNs). Consequently, the availability of computational resources to use such tools can limit their widespread use, such as for people who live in low-income countries and for those who want to run diagnostic apps on their own mobile devices.

In this study, we set in-hospital mortality prediction as a case study to explore the various ways of improving efficiency (ie, training speed, memory size, and inference speed) of neural network–based algorithms. Mortality prediction is a well-tried medical machine learning application wherein the mortality of a patient after being transferred to the intensive care unit (ICU) can be predicted based on their vital signs, laboratory tests, demographics, and other factors. Mortality prediction is important in clinical settings because such a prediction can help determine the declining state and need for intervention. We built baseline models with either recurrent neural network (RNN) or dense neural network architectures, based on which we explored efficiency improvements via neural network condensation without sacrificing the prediction accuracy. An RNN is a class of artificial neural networks wherein connections between nodes form a directed graph along a temporal sequence that consider a sequence of input in a recurrent manner. RNNs are widely used in clinical informatics in tasks such as temporal data analysis and clinical natural language processing.

Reduction of complexity and improvement of efficiency of artificial neural networks is an active field of research, wherein a wide range of methods have been explored. One representative example is neural network pruning, wherein a fraction of weights is removed from the trained model and the “lottery ticket” is found when the remaining weight can still be quickly trained with competitive loss and accuracy [

We used the Medical Information Mart for Intensive Care-III (MIMIC-III) critical care database for the implementation of our models [

Summary of patient data (N=33,798).

Variable | Value |

Mortality during ICU^{a} stay, n (%) |
3717 (10.9) |

Age in years, median (SD) | 65.8 (11.3) |

Male participants | 18,893 (55.9) |

^{a}ICU: intensive care unit.

Data were collected from the MIMIC-III database. Only data from the first 48 hours were used as inputs in our analysis. For the purpose of this study, 76 features were selected for analysis (see examples listed in

pH

Fraction-inspired oxygen

Systolic blood pressure

Height

Weight

Oxygen saturation

Diastolic blood pressure

Glucose

Temperature

Mean blood pressure

Capillary refill rate

Respiratory rate

Heart rate

Fraction inspired oxygen

Glasgow Coma Scale–50

Classification accuracy of all models were measured using area under the receiver operating curve AUROC (also called AUCROC) on the test set. Sizes of model were measured by the number of parameters and sizes of the saved model file. Inference speed was calculated based on time taken to make predictions on test data and was normalized per patient. We used Python 3.6, Keras 2.2.4 with TensorFlow 1.1.2, as the backend for the analysis.

Our RNN baseline model is designed as an RNN consisting of a masking layer, two LSTM layers, a dropout layer, and a dense output layer, as shown in

Architecture of recurrent neural network baseline model. DNN: deep neural network; LSTM: long short-term memory; ReLU: rectified linear unit.

Besides pruning upon RNN, we also tried another way by inserting an additional hidden dense layer into the inner gates of LSTM, which we called hLSTM, to improve the “power” of the LSTM. For a traditional LSTM, the inner structure is as follows:

where * is the matrix product; ⊗ is the element-wise product;

Our baseline feed forward artificial neural network—commonly called DNN—used in this project consists of three fully connected layers, a dropout layer, and an output layer. The fully connected layers have 256, 128, and 64 neurons, respectively, and they use rectified linear unit (ReLU) as the activation function. The dropout layer has a probabilistic dropout rate of 0.5. Sigmoid function was used as activation at the output layer. The loss function used was binary cross-entropy, and the optimization algorithm used was Adam. The baseline DNN model and the pruned DNN model (pDNN) were all trained for 20 epochs, using a batch size of 8. The input into the DNN model has the same feature set as LSTM model but does not consider time series information. The values were calculated by averaging nonmissing values across time steps.

All neural network prunings were conducted at the channel level, which means a neuron and all its inputs and outputs were removed from the model if the neuron is pruned. Keras surgeon library in python was used for pruning. In each layer, neurons were pruned if their mean weight across all inputs from the previous layer were below the set quantiles (ie, 25% and 50% in this study). The original model was trained for 1 epoch before pruning and was trained for another 19 epochs after pruning.

Quantization was applied on the DNN model post training. Parameters, including weights and activation, originally stored in a 32-bit floating point format were converted to 8 bits using TensorFlow Lite. A uniform quantization strategy was used, as previously described [_{min}; F_{max}), all the floating-point values were quantized into the range (0; 255) as 8 bits in a uniform manner, where F_{min} corresponds to 0 and F_{max} corresponds to 255.

The quantization process is

where _{q}

Recurrent artificial neural networks (or simply, RNNs) are a group of machine learning models widely used in clinical settings that take sequential or time series information as the input. However, training of RNNs and running inference from RNNs are relatively computationally intensive. In order to enable the machine learning algorithms to be used on devices with limited computational power, such as those in high-income countries and on mobile devices, we used three strategies to reduce the storage size of the model and to increase the speed of training and inference (

We built a baseline RNN using two layers of LSTM neurons to predict ICU mortality rates using MIMIC-III dataset [

Neural network condensation methods. (A) Hidden-layer long short-term memory (LSTM). Instead of single fix layer nonlinearity for gate control of LSTM, multiple layer neural network with ReLu as activation were used to enhance the gate controls. In this way, fewer layers of LSTMs were needed to build a model with similar performance. (B) A large portion of parameters in artificial neural networks are redundant. We pruned 50% of the channels (neurons) with the lowest weights in each layer to reduce size and complexity of the neural network. (C) Most artificial neuron network implementation in research settings uses 32- or 64-bit floating points for model parameters. We quantized the parameters to 8 bits after training to reduce sizes of the models. DNN: dense neural network.

Recurrent neural network condensation.

Model | Parameters, n | File size (kb) | Inference (seconds per sample) | Training time (seconds; 20 epochs) | Test AUROC^{a} (last epoch) |

Baseline LSTM^{b} |
|||||

Pruned LSTM | 3273 | 73 | 318 | 4990 | 0.853 |

Hidden-layer LSTM | 6993 | 111 | 254 | 3000 | 0.860 |

^{a}AUROC: area under the receiver operating curve.

^{b}LSTM: long short-term memory.

The first strategy was to modify the LSTM cell to increase the representation power of each layer. We modified the original neural network structure and added an additional hidden layer into the original LSTM class, wherein one additional layer called “hidden kernel” was inserted between the input kernel and the recurrent kernel (see equation 4). By using this strategy, we replaced the old 2-layer LSTM with only one layer of hLSTM, such that we simplified the overall structure by trying to embed the same quantity of information in this single “condensate” layer.

Both the baseline model and the hLSTM model with only one layer of hLSTM are trained under the same settings. The comparison of AUROC and accuracy is shown in

Accuracy, model size, and inference speed of feedforward recurrent and neural networks (RNNs) after different types of condensation. (A) Area under the receiver operating characteristic curve (AUROC) of various models. (B) Various model sizes in memory. (C) Inference speed of various models. Models included the RNN baseline model with two layers of long-term short memory (LSTM), pruned LSTM (pLSTM) model, and one hidden layer inserted in LSTM (hLSTM); deep neural network (DNN) baseline model; pruned DNN (pDNN) model; quantized DNN (qDNN) model.

Another method to condense RNN models is pruning, in which some unessential neurons of the RNN model are removed to minimize model size and increase speed. About 50% of LSTM neurons with lowest weights in each hidden layer were pruned after the first epoch of training. The pruned LSTM only has half of the number parameters of the original LSTM, but it achieves a similar level of accuracy, yielding an AUROC of 0.85 (

Test area under the receiver operating curve (AUROC) by training epoch for recurrent neural network (RNN) models. Evolution of different RNN models over training epochs on test data. The percentage next to the pruned long short-term memory (pLSTM) model indicates the pruned percentile. hLSTM: hidden-layer long short-term memory; LSTM: long short-term memory.

Feedforward neural network, or commonly called DNN if it has multiple hidden layers, is another widely used form of machine learning in clinical settings. We trained DNN with 3 hidden layers, consisting of 256, 128, and 64 neurons in each layer, to enable ICU mortality prediction. The baseline DNN achieved an AUROC of 0.82, using patient data collected within the first 48 hours after admission. We explored two methods to condense the size of the DNN. The first method, called pruning, used the pruning strategy as in RNN; for this purpose, 50% of the channels were pruned after the first epoch of training, the prediction accuracy of the pDNN maintained at the same level as the original DNN, and the inference speed doubled (

Feedforward neural network condensation.

Model | Parameters, n | File size (kb) | Inference (seconds per sample) | Training time (seconds;20 epochs) | Test AUROC^{a} (last epoch) |

Baseline DNN^{b} |
60,929 | 767 | 20 | 3300 | 0.82 |

Pruned DNN | 27,312 | 315 | 10 | 3310 | 0.81 |

Quantized DNN | 60,929 | 64 | 15 | N/A^{c} |
0.82 |

^{a}AUROC: area under the receiver operating curve.

^{b}DNN:

^{c}N/A: not applicable.

In this study, we were able to use data from the MIMIC-III database [

area under the receiver operating curve

deep neural network

hidden-layer long short-term memory

intensive care unit

long short-term memory

Medical Information Mart for Intensive Care-III

pruned deep neural network

quantized deep neural network

rectified linear unit

recurrent neural network

We thank the authors of the published article “Multitask Learning and Benchmarking with Clinical Time Series Data” for providing the basic code for data preprocessing from Medical Information Mart for Intensive Care-III via GitHub. We thank Professor David Sontag (Massachusetts Institute of Technology, Cambridge, MA) for their helpful discussion.

None declared.

_{0}Regularization