Skip to content

Commit 3d4d2dc

Browse files
authored
Merge pull request #55 from UCL-CORU/rewrite-notebooks
Rewrite notebooks
2 parents 29c9eee + d81206b commit 3d4d2dc

10 files changed

Lines changed: 340 additions & 383 deletions

README.md

Lines changed: 36 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -29,30 +29,52 @@
2929
[pypi-version]: https://img.shields.io/pypi/v/patientflow -->
3030
<!-- prettier-ignore-end -->
3131

32-
Welcome to the PatientFlow repository, which provides predictive modelling for hospital bed management. I'm [Zella King](https://github.com/zmek/), a health data scientist in the Clinical Operational Research Unit (CORU) at University College London. Since 2020, I have worked with University College London Hospital (UCLH) on practical tools to improve patient flow through the hospital.
32+
## Summary
3333

34-
With a team from UCLH, I developed a predictive tool that is now in daily use by bed managers at the hospital. The tool generates predictions of emergency demand for beds, using real-time data from the hospital's patient record system.
34+
patientflow, a Python package, converts patient-level predictions into output that is useful for bed managers in hospitals.
3535

36-
I am sharing the code I wrote for UCLH as a reusable resource because I want to make it easier for researchers to convert patient-level predictions into output that is useful for bed managers in hospitals. This repository includes a Python package, called patientflow, which converts patient-level predictions into output that is useful for bed managers. If you have a predictive model of some outcome for a patient, like admission or discharge from hospital, you can use patientflow to create bed count distributions for a cohort of patients.
36+
We developed this code originally for University College London Hospitals (UCLH) NHS Trust to predict the number of emergency admissions within the next eight hours. The methods generalise to other aspects of patient flow in hospitals, including predictions of discharge numbers, within a group of patients. It can be applied to any problem where it is useful to convert patient-level predictions into outcomes for a whole cohort of patients at a point in time.
3737

38-
The methods generalise to any problem where it is useful to convert patient-level predictions into outcomes for a whole cohort of patients at a point in time. The repository includes a synthetic dataset and a series of notebooks demonstrating the use of the package.
38+
If you have a predictive model of some outcome for a patient, like admission or discharge from hospital, you can use patientflow to create bed count distributions for a cohort of patients. We show how to prepare your data and train models for these kinds of problems. The repository includes a synthetic dataset and a series of notebooks demonstrating the use of the package.
3939

40-
## Main features of my modelling approach
40+
## What patientflow is for:
4141

42-
- **Led by what users need:** My work is the result of close collaboration with operations directors and bed managers in the Coordination Centre, University College London Hospital (UCLH), since 2020. What is modelled directly reflects how they work and what is most useful to them.
43-
- **Focused on short-term predictions:** The modelling is designed for predicting demand within a short time horizon eg 8 or 12 hours. I show how to use my code to predict how many beds will be needed emergency patients. (Later I plan to add modules that for elective demand, discharge and transfers between specialties.)
44-
- **Assumes real-time data is available:** Hospital bed managers have to deal with rapidly changing situations. My focus is on the use of real-time data (or near to real-time) to help them make informed decisions.
42+
- Managing patient flow in hospitals: The package can be used to predict numbers of emergency admissions, discharges or transfers between units
43+
- Short-term operational planning: The predictions produced by this package are designed for bed managers who need to make decisions within an 4-16 hour timeframe.
44+
- Working with real-time data: The design assumes that data from an electronic health record (EHR) is available in real-time, or near to real-time
45+
- Point-in-time analysis: The packages works by taking "snapshots" of groups of patients at a particular moment, and making projections from those specific moments.
4546

46-
## Main Features of this repository
47+
## What patientflow is NOT for:
4748

48-
- **Reproducible** - I follow the principles of [Reproducible Analytical Pipelines](https://analysisfunction.civilservice.gov.uk/support/reproducible-analytical-pipelines/). The repository can be installed as a Python package, and imported into your own code.
49-
- **Accessible** - All the elements are based on simple techniques and methods in Health Data Science and Operational Research. I intend that anyone with some knowledge of Python could understand and adapt the code for their use.
50-
- **Practical:** - I believe that it is easier to follow the steps I took if you have access to the same data I have. UCLH have released an anonymised version of real patient data, which you can request access on [Zenodo](https://zenodo.org/records/14866057), or you can use the synthetic dataset, derived from real patient data, in the `data-synthetic` folder. (Note that, if you use the synthetic dataset, you will observe articifically inflated model performance.)
51-
- **Interactive:** The repository includes a set of notebooks with code written on Python and commentary. If you clone the repo into your own workspace and have an environment for running Jupyter notebooks, you will be able to interact with the code and see it running.
49+
- Long-term capacity planning: The package focuses on immediate operational needs (hours ahead), not strategic planning over weeks or months.
50+
- Making decisions about individual patients: The package is not designed for clinical decision-making about specific patients. It relies on data entered into the EHR by clinical staff looking after patients, but cannot and should not be use to influence their decision-making
51+
- General hospital analytics: It is specifically focused on short-term bed management, not broader hospital analytics like long-term demand and capacity planning.
52+
- Finished/historical patient analysis: While historical data might train underlying models, the package itself focuses on patients currently in the hospital or soon to arrive
53+
- Replacing human judgment: It augments the information available to bed managers, but isn't meant to automate bed management decisions completely.
54+
55+
## This package will help you if you want to:
56+
57+
- Convert individual patient predictions to cohort-level insights: Its core purpose is the creation of aggregate bed count distributions, because bed numbers are the currencly used by bed managers.
58+
- Make predictions for unfinished patient visits: It is designed for making predictions when outcome at the end of the visit are as yet unknown.
59+
- Develop your own predictive models of emergency demand: The package includes a fully worked example of how to convert data from A&E visits into the right structure, and use that data to train models that predict numbers of emergency beds.
60+
61+
## This package will not help you if:
62+
63+
- You work with time series data: patientflow works with snapshots of a hospital visit summarising what is in the patient record up to that point in time
64+
- Your focus is on predicting clinical outcomes: the approach is designed
65+
66+
## Mathematical assumptions underlying the conversion from individual to cohort predictions:
67+
68+
- Independence of patient outcomes: The package assumes that individual patient outcomes are conditionally independent given the features used in prediction.
69+
- Symbolic probability generation: The conversion uses symbolic mathematics (via SymPy) to construct a probability generating function that represents the exact distribution of possible cohort outcomes.
70+
- Bernoulli outcome model: Each patient outcome is modeled as a Bernoulli trial with its own probability, and the package computes the exact probability distribution for the sum of these independent trials.
71+
- Coefficient extraction approach: The method works by expanding a symbolic expression and extracting coefficients corresponding to each possible cohort outcome count.
72+
- Optional weighted aggregation: When converting individual probabilities to cohort-level predictions, the package allows for weighted importance of individual predictions, modifying the contribution of each patient to the overall distribution in specific contexts (eg admissions to different specialties).
73+
- Discrete outcome space: The package assumes outcomes can be represented as discrete counts (e.g., number of admissions) rather than continuous values.
5274

5375
## Getting started
5476

55-
- Exploration: Start with the [notebooks README](notebooks/README.md) to get an outline of the notebooks, and read the [patientflow README](src/patientflow/README.md) to understand my intentions for the Python package
77+
- Exploration: Start with the [notebooks README](notebooks/README.md) to get an outline of what is included in the notebooks, and read the [patientflow README](src/patientflow/README.md) for an overview of the Python package
5678
- Installation: Follow the instructions below to set up the environment and install necessary dependencies in your own environment
5779
- Configuration: Repurpose config.yaml to configure the package to your own data and user requirements
5880

notebooks/0_Background.ipynb

Whitespace-only changes.

notebooks/2_Specify_emergency_demand_model.ipynb renamed to notebooks/4_Specify_emergency_demand_model.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"# Consider what makes a useful model for our users\n",
7+
"# Specify an emergendy demand model\n",
88
"\n",
99
"In [the first notebook](1_Meet_the_users_of_our_predictions.ipynb) I introduced bed managers and their work. Here I talk about what they need from predictions of emergency demand, and explain choices we made to make the model useful to them.\n",
1010
"\n",

notebooks/4a_Predict_probability_of_admission_from_ED.ipynb

Lines changed: 93 additions & 125 deletions
Large diffs are not rendered by default.

src/patientflow/train/classifiers.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -256,7 +256,7 @@ def train_classifier(
256256
use_balanced_training: bool = True,
257257
majority_to_minority_ratio: float = 1.0,
258258
calibrate_probabilities: bool = True,
259-
calibration_method: str = "isotonic",
259+
calibration_method: str = "sigmoid",
260260
) -> TrainedClassifier:
261261
"""
262262
Train a single model including data preparation and balancing.

src/patientflow/viz/calibration_plot.py

Lines changed: 27 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -2,67 +2,56 @@
22
from sklearn.calibration import calibration_curve
33
from patientflow.predict.emergency_demand import add_missing_columns
44
from patientflow.prepare import get_snapshots_at_prediction_time
5-
from patientflow.load import get_model_key, load_saved_model
5+
from patientflow.model_artifacts import TrainedClassifier
66

77
# Define the color scheme
88
primary_color = "#1f77b4"
99
secondary_color = "#aec7e8"
1010

1111

1212
def plot_calibration(
13-
prediction_times,
13+
trained_models: list[TrainedClassifier],
1414
media_file_path,
15-
trained_models,
1615
test_visits,
1716
exclude_from_training_data,
1817
strategy="uniform",
19-
model_group_name="admssions",
20-
model_name_suffix=None,
2118
suptitle=None,
22-
model_file_path=None,
2319
):
24-
# Load models if not provided
25-
if trained_models is None:
26-
if model_file_path is None:
27-
raise ValueError(
28-
"model_file_path must be provided if trained_models is None"
29-
)
30-
trained_models = {}
31-
for prediction_time in prediction_times:
32-
model_name = get_model_key(model_group_name, prediction_time)
33-
if model_name_suffix:
34-
model_name = f"{model_name}_{model_name_suffix}"
35-
trained_models[model_name] = load_saved_model(
36-
model_file_path, model_group_name, prediction_time
37-
)
38-
39-
# Sort prediction times by converting to minutes since midnight
40-
prediction_times_sorted = sorted(
41-
prediction_times,
42-
key=lambda x: x[0] * 60
43-
+ x[1], # Convert (hour, minute) to minutes since midnight
20+
"""
21+
Plot calibration curves for multiple models.
22+
23+
Args:
24+
trained_models: List of TrainedClassifier objects
25+
media_file_path: Path where the plot should be saved
26+
test_visits: DataFrame containing test visit data
27+
exclude_from_training_data: Columns to exclude from the test data
28+
strategy: Strategy for calibration curve binning ('uniform' or 'quantile')
29+
suptitle: Optional super title for the entire figure
30+
"""
31+
# Sort trained_models by prediction time
32+
trained_models_sorted = sorted(
33+
trained_models,
34+
key=lambda x: x.training_results.prediction_time[0] * 60
35+
+ x.training_results.prediction_time[1],
4436
)
45-
num_plots = len(prediction_times_sorted)
37+
num_plots = len(trained_models_sorted)
4638
fig, axs = plt.subplots(1, num_plots, figsize=(num_plots * 5, 4))
4739

4840
# Handle case of single prediction time
4941
if num_plots == 1:
5042
axs = [axs]
5143

52-
for i, prediction_time in enumerate(prediction_times_sorted):
53-
# Get model name and pipeline for this prediction time
54-
model_name = get_model_key(model_group_name, prediction_time)
55-
if model_name_suffix:
56-
model_name = f"{model_name}_{model_name_suffix}"
57-
44+
for i, trained_model in enumerate(trained_models_sorted):
5845
# Use calibrated pipeline if available, otherwise use regular pipeline
5946
if (
60-
hasattr(trained_models[model_name], "calibrated_pipeline")
61-
and trained_models[model_name].calibrated_pipeline is not None
47+
hasattr(trained_model, "calibrated_pipeline")
48+
and trained_model.calibrated_pipeline is not None
6249
):
63-
pipeline = trained_models[model_name].calibrated_pipeline
50+
pipeline = trained_model.calibrated_pipeline
6451
else:
65-
pipeline = trained_models[model_name].pipeline
52+
pipeline = trained_model.pipeline
53+
54+
prediction_time = trained_model.training_results.prediction_time
6655

6756
# Get test data for this prediction time
6857
X_test, y_test = get_snapshots_at_prediction_time(
@@ -112,3 +101,4 @@ def plot_calibration(
112101

113102
plt.savefig(calib_plot_path)
114103
plt.show()
104+
plt.close()

src/patientflow/viz/distribution_plots.py

Lines changed: 31 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -1,65 +1,59 @@
11
import matplotlib.pyplot as plt
22
from patientflow.predict.emergency_demand import add_missing_columns
33
from patientflow.prepare import get_snapshots_at_prediction_time
4-
from patientflow.load import get_model_key, load_saved_model
4+
from patientflow.model_artifacts import TrainedClassifier
5+
from typing import Optional
6+
from pathlib import Path
57

68
# Define the color scheme
79
primary_color = "#1f77b4"
810
secondary_color = "#ff7f0e"
911

1012

1113
def plot_prediction_distributions(
12-
prediction_times,
13-
media_file_path,
14-
trained_models,
14+
trained_models: list[TrainedClassifier],
1515
test_visits,
1616
exclude_from_training_data,
17-
model_group_name="admissions",
18-
model_name_suffix=None,
1917
bins=30,
20-
model_file_path=None,
18+
media_file_path: Optional[Path] = None,
2119
):
22-
# Load models if not provided
23-
if trained_models is None:
24-
if model_file_path is None:
25-
raise ValueError(
26-
"model_file_path must be provided if trained_models is None"
27-
)
28-
trained_models = {}
29-
for prediction_time in prediction_times:
30-
model_name = get_model_key(model_group_name, prediction_time)
31-
if model_name_suffix:
32-
model_name = f"{model_name}_{model_name_suffix}"
33-
trained_models[model_name] = load_saved_model(
34-
model_file_path, model_group_name, prediction_time
35-
)
36-
37-
# Sort prediction times by converting to minutes since midnight
38-
prediction_times_sorted = sorted(
39-
prediction_times,
40-
key=lambda x: x[0] * 60 + x[1],
20+
"""
21+
Plot prediction distributions for multiple models.
22+
23+
Args:
24+
trained_models: List of TrainedClassifier objects
25+
test_visits: DataFrame containing test visit data
26+
exclude_from_training_data: Columns to exclude from the test data
27+
bins: Number of bins for the histogram (default: 30)
28+
media_file_path: Path to save the plot (default: None)
29+
"""
30+
if media_file_path is None:
31+
raise ValueError("media_file_path must be provided")
32+
33+
# Sort trained_models by prediction time
34+
trained_models_sorted = sorted(
35+
trained_models,
36+
key=lambda x: x.training_results.prediction_time[0] * 60
37+
+ x.training_results.prediction_time[1],
4138
)
42-
num_plots = len(prediction_times_sorted)
39+
num_plots = len(trained_models_sorted)
4340
fig, axs = plt.subplots(1, num_plots, figsize=(num_plots * 5, 4))
4441

4542
# Handle case of single prediction time
4643
if num_plots == 1:
4744
axs = [axs]
4845

49-
for i, prediction_time in enumerate(prediction_times_sorted):
50-
# Get model name and pipeline for this prediction time
51-
model_name = get_model_key(model_group_name, prediction_time)
52-
if model_name_suffix:
53-
model_name = f"{model_name}_{model_name_suffix}"
54-
46+
for i, trained_model in enumerate(trained_models_sorted):
5547
# Use calibrated pipeline if available, otherwise use regular pipeline
5648
if (
57-
hasattr(trained_models[model_name], "calibrated_pipeline")
58-
and trained_models[model_name].calibrated_pipeline is not None
49+
hasattr(trained_model, "calibrated_pipeline")
50+
and trained_model.calibrated_pipeline is not None
5951
):
60-
pipeline = trained_models[model_name].calibrated_pipeline
52+
pipeline = trained_model.calibrated_pipeline
6153
else:
62-
pipeline = trained_models[model_name].pipeline
54+
pipeline = trained_model.pipeline
55+
56+
prediction_time = trained_model.training_results.prediction_time
6357

6458
# Get test data for this prediction time
6559
X_test, y_test = get_snapshots_at_prediction_time(

0 commit comments

Comments
 (0)